Skip to content

Add remote Spark test orchestration framework#211

Closed
sdebruyn wants to merge 2 commits into
mainfrom
feature/spark-remote-test-orchestration
Closed

Add remote Spark test orchestration framework#211
sdebruyn wants to merge 2 commits into
mainfrom
feature/spark-remote-test-orchestration

Conversation

@sdebruyn
Copy link
Copy Markdown
Owner

Summary

  • Adds a native --remote pytest flag that transparently delegates FabricSpark test execution to a Spark Job Definition on Fabric infrastructure
  • Adds Mode B (mounted lakehouse) support that redirects test artifacts to OneLake-accessible paths
  • The developer experience is simply pytest --de --remote -k "TestFoo" — no separate CLI needed

How it works

  1. Local pytest collects tests normally
  2. pytest_runtestloop hook intercepts when --remote is set
  3. Syncs project to lakehouse via azcopy
  4. Triggers a Spark Job Definition with forwarded pytest args + --junitxml
  5. Polls for completion, downloads results
  6. Parses junitxml and reports results back to local pytest session

New files

  • tests/spark_remote/conftest_plugin.py — pytest_runtestloop hook implementation
  • tests/spark_remote/result_reporter.py — junitxml → TestReport mapping
  • tests/spark_remote/orchestrator.py — coordinates sync + job client
  • tests/spark_remote/spark_job_client.py — Fabric REST API wrapper
  • tests/spark_remote/sync.py — azcopy sync wrapper
  • tests/spark_remote/spark_entry_point.py — runs inside Spark job

Test plan

  • Verify pytest --de still works without --remote (no regression)
  • Verify --remote without --de exits with error
  • Test Mode B dry-run with FABRIC_TEST_SPARK_EXEC_MODE=mounted
  • End-to-end test with real Fabric workspace

Closes #207

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 16, 2026 17:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8fd23fe1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/spark_remote/orchestrator.py Outdated
Comment on lines +46 to +47
if not workspace_id and not workspace_name:
raise ValueError("FABRIC_TEST_WORKSPACE_ID or FABRIC_TEST_WORKSPACE_NAME must be set")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require both workspace name and ID for orchestration

from_env() currently accepts either FABRIC_TEST_WORKSPACE_ID or FABRIC_TEST_WORKSPACE_NAME, but downstream code needs both: ProjectSync builds OneLake URLs from workspace_name while SparkJobClient and job URLs use workspace_id. If users set only one variable (which this validation allows), remote execution fails later with malformed sync or API paths instead of a clear configuration error.

Useful? React with 👍 / 👎.

Comment thread tests/spark_remote/spark_job_client.py Outdated
path = (
f"/v1/workspaces/{self._workspace_id}/items/{item_id}/jobs/instances?jobType=sparkjob"
)
body = {"executionData": {"commandLineArguments": " ".join(command_line_args)}}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve quoting when forwarding remote pytest args

Building commandLineArguments with ' '.join(command_line_args) drops shell quoting, so arguments that contain spaces (for example -k "foo and bar") are split into multiple tokens in the remote job and change test selection semantics. This makes remote runs execute the wrong tests for common -k expressions and can produce mismatched local vs remote results.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pytest-native --remote execution path intended to run FabricSpark integration tests as a Fabric Spark Job Definition, syncing the repo to OneLake and replaying remote junitxml results back into the local pytest session.

Changes:

  • Introduces pytest_runtestloop interception in tests/conftest.py plus remote/mounted artifact path redirection for project_root, profiles_root, and logs_dir.
  • Adds a new tests/spark_remote/ framework: OneLake sync (azcopy), Fabric Spark Job Definition client (REST), Spark entry point, and junitxml result mapping back to pytest reports.
  • Updates .gitignore and test.env.sample for generated remote-run files and configuration.

Reviewed changes

Copilot reviewed 8 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/spark_remote/sync.py Implements azcopy-based project upload and artifact download, plus prerequisite checks.
tests/spark_remote/spark_job_client.py Fabric REST wrapper to list/create/run/poll Spark Job Definitions.
tests/spark_remote/spark_entry_point.py Spark-side runner that installs deps and executes pytest, writing junitxml artifacts.
tests/spark_remote/result_reporter.py Parses junitxml and emits pytest TestReports to mirror remote results locally.
tests/spark_remote/orchestrator.py Coordinates prerequisite checks, sync, Spark job submission, and artifact download.
tests/spark_remote/conftest_plugin.py Implements the remote pytest_runtestloop handler and argument forwarding.
tests/spark_remote/__init__.py Initializes the new tests.spark_remote package.
tests/conftest.py Registers --remote, adds runtestloop delegation, and redirects artifact/log paths for remote/mounted modes.
test.env.sample Documents optional env vars for remote/mounted Spark test execution.
.gitignore Ignores generated remote files and artifact/result directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/spark_remote/sync.py Outdated
Comment on lines +117 to +120
result = subprocess.run(["az", "account", "show"], capture_output=True, text=True)
if result.returncode != 0:
errors.append("Azure CLI not logged in. Run: az login")

Comment on lines +40 to +59
workspace_id = os.environ.get("FABRIC_TEST_WORKSPACE_ID", "")
workspace_name = os.environ.get("FABRIC_TEST_WORKSPACE_NAME", "")
lakehouse_name = os.environ.get("FABRIC_TEST_LAKEHOUSE_NAME", "")
lakehouse_id = os.environ.get("FABRIC_TEST_REMOTE_LAKEHOUSE_ID", "")
job_name = os.environ.get("FABRIC_TEST_SPARK_JOB_NAME", "dbt-fabric-tests")

if not workspace_id and not workspace_name:
raise ValueError("FABRIC_TEST_WORKSPACE_ID or FABRIC_TEST_WORKSPACE_NAME must be set")
if not lakehouse_name:
raise ValueError("FABRIC_TEST_LAKEHOUSE_NAME must be set")

project_root = Path(__file__).resolve().parent.parent.parent
return cls(
workspace_id=workspace_id,
workspace_name=workspace_name,
lakehouse_name=lakehouse_name,
lakehouse_id=lakehouse_id,
project_root=project_root,
job_name=job_name,
)
Comment on lines +48 to +56
def list_spark_job_definitions(self) -> list[dict]:
items = []
path = f"/v1/workspaces/{self._workspace_id}/sparkJobDefinitions"
while path:
resp = self._get(path)
data = resp.json()
items.extend(data.get("value", []))
path = data.get("continuationUri")
return items
Comment on lines +94 to +105
def run_on_demand(self, item_id: str, command_line_args: list[str]) -> tuple[str, str]:
path = (
f"/v1/workspaces/{self._workspace_id}/items/{item_id}/jobs/instances?jobType=sparkjob"
)
body = {"executionData": {"commandLineArguments": " ".join(command_line_args)}}
resp = self._post(path, json_data=body)

location = resp.headers.get("Location", "")
# Location header format: .../items/{item_id}/jobs/instances/{job_instance_id}
parts = location.rstrip("/").split("/")
job_instance_id = parts[-1] if parts else ""
return item_id, job_instance_id
Comment thread tests/spark_remote/conftest_plugin.py Outdated
Comment on lines +32 to +44
_FORWARDED_OPTIONS = (
"-k",
"-v",
"--verbose",
"-s",
"--capture",
"--with-grants",
"--with-python",
"--de",
"--dw",
"-x",
"--exitfirst",
)
Comment on lines +47 to +68
def _build_remote_args(session: pytest.Session) -> list[str]:
args: list[str] = []
config = session.config

if config.getoption("-k"):
args.extend(["-k", config.getoption("-k")])

if config.getoption("verbose", 0) > 0:
args.append("-" + "v" * config.getoption("verbose"))

if config.getoption("--de", default=False):
args.append("--de")

if config.getoption("--dw", default=False):
args.append("--dw")

if config.getoption("--with-grants", default=False):
args.append("--with-grants")

if config.getoption("--with-python", default=False):
args.append("--with-python")

Comment on lines +69 to +74
if config.getoption("-x", default=False):
args.append("-x")

args.append("--junitxml=/lakehouse/default/Files/dbt-test-artifacts/results.xml")

return args
Comment on lines +15 to +45
LAKEHOUSE_ROOT = "/lakehouse/default"
PROJECT_DIR = f"{LAKEHOUSE_ROOT}/Files/dbt-fabric-tests"
ARTIFACTS_DIR = f"{LAKEHOUSE_ROOT}/Files/dbt-test-artifacts"


def main() -> None:
os.makedirs(ARTIFACTS_DIR, exist_ok=True)

requirements_file = f"{PROJECT_DIR}/requirements-remote.txt"
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "-r", requirements_file, "--quiet"],
)

subprocess.check_call(
[sys.executable, "-m", "pip", "install", "-e", PROJECT_DIR, "--no-deps", "--quiet"],
)

env_file = f"{PROJECT_DIR}/test.env.remote"
if os.path.exists(env_file):
from dotenv import load_dotenv

load_dotenv(env_file, override=True)

os.environ["FABRIC_TEST_SPARK_EXEC_MODE"] = "remote"

pytest_args = sys.argv[1:]

has_junitxml = any(arg.startswith("--junitxml") for arg in pytest_args)
if not has_junitxml:
pytest_args.extend(["--junitxml", f"{ARTIFACTS_DIR}/results.xml"])

Comment thread tests/conftest.py Outdated
elif mode == "mounted":
path = os.getenv("FABRIC_TEST_ONELAKE_PATH")
if not path:
raise ValueError("FABRIC_TEST_ONELAKE_PATH required when SPARK_EXEC_MODE=mounted")
Comment on lines +214 to +219
if file_attr:
file_module = file_attr.replace("/", ".").removesuffix(".py")
remaining = classname.removeprefix(file_module + ".")
if remaining and remaining != classname:
return f"{file_attr}::{remaining.replace('.', '::')}::{name}"
return f"{file_attr}::{name}"
Introduces a pytest-native `--remote` flag that transparently delegates
FabricSpark test execution to a Spark Job Definition on Fabric. The
developer runs `pytest --de --remote -k "TestFoo"` and gets normal
pytest output — the plugin handles sync, job submission, polling, and
result reporting via junitxml.

Also adds Mode B (mounted lakehouse) support via FABRIC_TEST_SPARK_EXEC_MODE
env var, which redirects test artifacts to OneLake-accessible paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sdebruyn sdebruyn force-pushed the feature/spark-remote-test-orchestration branch from d8fd23f to 165b00c Compare May 16, 2026 17:17
- Require all env vars (workspace_id, workspace_name, lakehouse_id)
- Use shlex.join for proper arg quoting in Spark job submission
- Handle FileNotFoundError when checking for az CLI
- Support absolute continuationUri URLs in pagination
- Remove unused _FORWARDED_OPTIONS constant
- Fix error message to reference correct env var name
- Add unit tests for junitxml parsing and nodeid reconstruction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@sdebruyn sdebruyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all review comments in commit 2d0af9c:

Fixed (7 items):

  1. Require both workspace ID and name (Codex P1 + Copilot) — Validation now requires all four env vars with a clear error listing missing ones.
  2. shlex.join for arg quoting (Codex P2 + Copilot) — run_on_demand() now uses shlex.join() to preserve quoting for -k expressions with spaces.
  3. FileNotFoundError for az CLI (Copilot) — check_prerequisites() now wraps subprocess.run in try/except and returns a clear install message.
  4. Pagination absolute URLs (Copilot) — _get() and _post() now detect absolute URLs and skip prepending FABRIC_API_BASE.
  5. Unused _FORWARDED_OPTIONS (Copilot) — Removed the dead constant.
  6. Error message typo (Copilot) — Now correctly says FABRIC_TEST_SPARK_EXEC_MODE=mounted.
  7. Unit tests for nodeid reconstruction (Copilot) — Added 11 tests in tests/unit/test_result_reporter.py.

Not changing (3 items):

  • Limited arg forwarding — Intentional. Remote Spark execution has different constraints; the explicit allowlist ensures only tested, compatible options are forwarded.
  • Hardcoded junitxml path (2 comments) — This is single-user test infrastructure. The azcopy sync with --delete-destination=true ensures clean state before each run. Per-run UUIDs would complicate download/cleanup without benefit.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying dbt-fabric with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2d0af9c
Status:⚡️  Build in progress...

View logs

@sdebruyn
Copy link
Copy Markdown
Owner Author

Duplicate of #208 — all commits (including review fixes) have been moved to #208.

@sdebruyn sdebruyn closed this May 16, 2026
@sdebruyn sdebruyn deleted the feature/spark-remote-test-orchestration branch May 16, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote Spark test orchestration framework

2 participants