Generic sdist-build hooks in conda; port prebuilt sdist/git support#92
Conversation
51ecec2 to
7b0030d
Compare
Canon PR Review — Engineering Standards ScorecardPR: Make conda prebuilt-agnostic; restore deepspeed/transnetv2 sdist support (netflixext 1.3.11→1.3.12, prebuilt 0.2.1→0.3.0) Scores
Decision: APPROVE Violations FoundReliabilityCC-058 — Explicit Timeouts on All Network/Subprocess Calls (Source: Release It!, DDIA)
MaintainabilityCC-093 — Fix Broken Windows: TODOs Without Issue Tracker References (Source: The Pragmatic Programmer)
What's Working Well
Scored against: CC-003, CC-005, CC-006, CC-011, CC-021 (Code Quality) | CC-027, CC-015, CC-028, CC-017 (Module Design) | CC-014, CC-048, CC-056, CC-058, CC-060 (Reliability) | CC-019, CC-031, CC-093, CC-107 (Maintainability) |
…rity/robustness) - python_requires>=3.10 (blocking): add a setup.py comment (both packages) that the floor mirrors the internal minimum (<3.10 dropped internally); no 3.10-only construct in the code — _parse_build_system_requires has tomllib/tomli/regex. - Pass B shared overlay: _overlay_install now warns when a build dep is re-staged at a DIFFERENT version (surfaces the documented shared-overlay conflict instead of a silent miscompilation). - deferred_sdists: comment that "url" is the guaranteed source (is_downloadable_url filter) and "filename" is only a best-effort hint (harmless if None). - _gather_embedded_wheels Case D/E: comment that p.url is a placeholder (is_real_url=False + cached wheel => p.url is never fetched). - Pass B overlay cleanup: comment that the on-failure rmtree + the finally rmtree double-remove is intentional and safe (ignore_errors=True). No functional change to validated paths (comments + one diagnostic warning). 77 prebuilt unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Canon PR Review — Engineering Standards ScorecardPR: Generic sdist-build hooks in conda; port prebuilt sdist/git support Scores
Decision: APPROVE Violations FoundCode QualityCC-021 — Long Method (Source: Refactoring, Clean Code)
CC-006 — Minor knowledge duplication (Source: The Pragmatic Programmer)
ReliabilityCC-058 — Missing explicit timeouts on subprocess calls (Source: Release It!)
What's Working Well
To UnblockNothing blocking — this is an APPROVE. The two suggestions above (extract a per-sdist helper from Scored against: CC-003, CC-005, CC-006, CC-011, CC-021 (Code Quality) | CC-027, CC-015, CC-028, CC-017 (Module Design) | CC-014, CC-048, CC-056, CC-058, CC-060 (Reliability) | CC-019, CC-031, CC-093, CC-107 (Maintainability) |
Canon PR Review — Engineering Standards ScorecardPR: generic conda sdist-build-deferral hooks + metaflow-prebuilt orchestration package (v0.3.0) Scores
Decision: APPROVE Violations FoundCode QualityCC-021 — Long Method (Source: Refactoring, Clean Code)
ReliabilityCC-058 — No explicit timeouts on subprocess calls (Source: Release It!, DDIA)
What's Working Well
Scored against: CC-003, CC-005, CC-006, CC-011, CC-021 (Code Quality) | CC-027, CC-015, CC-028, CC-017 (Module Design) | CC-014, CC-048, CC-056, CC-058, CC-060 (Reliability) | CC-019, CC-031, CC-093, CC-107 (Maintainability) |
…rity/robustness) - python_requires>=3.10 (blocking): add a setup.py comment (both packages) that the floor mirrors the internal minimum (<3.10 dropped internally); no 3.10-only construct in the code — _parse_build_system_requires has tomllib/tomli/regex. - Pass B shared overlay: _overlay_install now warns when a build dep is re-staged at a DIFFERENT version (surfaces the documented shared-overlay conflict instead of a silent miscompilation). - deferred_sdists: comment that "url" is the guaranteed source (is_downloadable_url filter) and "filename" is only a best-effort hint (harmless if None). - _gather_embedded_wheels Case D/E: comment that p.url is a placeholder (is_real_url=False + cached wheel => p.url is never fetched). - Pass B overlay cleanup: comment that the on-failure rmtree + the finally rmtree double-remove is intentional and safe (ignore_errors=True). No functional change to validated paths (comments + one diagnostic warning). 77 prebuilt unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
78a123b to
1c79184
Compare
…rity/robustness) - python_requires>=3.10 (blocking): add a setup.py comment (both packages) that the floor mirrors the internal minimum (<3.10 dropped internally); no 3.10-only construct in the code — _parse_build_system_requires has tomllib/tomli/regex. - Pass B shared overlay: _overlay_install now warns when a build dep is re-staged at a DIFFERENT version (surfaces the documented shared-overlay conflict instead of a silent miscompilation). - deferred_sdists: comment that "url" is the guaranteed source (is_downloadable_url filter) and "filename" is only a best-effort hint (harmless if None). - _gather_embedded_wheels Case D/E: comment that p.url is a placeholder (is_real_url=False + cached wheel => p.url is never fetched). - Pass B overlay cleanup: comment that the on-failure rmtree + the finally rmtree double-remove is intentional and safe (ignore_errors=True). No functional change to validated paths (comments + one diagnostic warning). 77 prebuilt unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1c79184 to
8adf4f3
Compare
Give core conda two generic, default-off primitives and move ALL
prebuilt-specific orchestration into metaflow-prebuilt, so conda has zero
knowledge of the prebuilt feature.
metaflow-netflixext (1.3.11 -> 1.3.12), +99/-6 in conda core:
- pip_resolver: `defer_pypi_sdist_build` records sdists ({name,version,url})
instead of building them during resolution; default False.
- conda._create/create_for_step/create_for_name: `only_binary` installs only
pre-built wheels (analogous to `pip/uv --only-binary=:all:`); default False.
- conda_environment: `_make_envs_resolver()` factory hook so a subclass can
opt into deferral without duplicating init_environment.
Every change is a default-off param or a guarded branch; non-prebuilt
resolve/create behaviour is byte-for-byte unchanged. No `prebuilt`,
`deferred_builds.json`, or `schema_version` references in conda core.
metaflow-prebuilt (0.2.1 -> 0.3.0): owns the prebuilt protocol.
- prebuilt_conda_environment: `_make_envs_resolver` enables deferral;
`_gather_embedded_wheels` materializes wheels for non-web (git/local)
packages; `_generate_dockerfile` writes deferred_builds.json (schema "2")
+ COPYs it and deferred_wheels into the image.
- prebuilt_build_install: real two-pass installer (was a stub). Pass A
`create(only_binary=True)`; Pass B builds deferred sdists via uv. Owns the
/app/deferred_builds.json wire format + schema; _BUILD_INSTALL_MODULE ->
the prebuilt namespace.
Restores the deepspeed (sdist) / transnetv2 (git) support that lived inline
in mli-metaflow-custom but was never ported to the OSS packages.
Validated: 57 prebuilt unit tests; edited netflixext imports with all flags
default-off; composition checks (deferred path writes schema-2 json + COPY;
no-deferred path unchanged; resolver opt-in; container reader).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…>=1.3.12, sanitize PYTHONPATH - P1: _get_or_build_image scoped the resolver-wide deferred_sdists aggregate to the packages in THIS image's resolved_env. Previously every image received every environment's deferred sdists, so an unrelated step's image would run Pass B for an sdist it doesn't contain. (The inline impl had the same bug.) - setup.py: bump metaflow-netflixext lower bound 1.3.11 -> 1.3.12 (the prebuilt code calls only_binary / defer_pypi_sdist_build, added in 1.3.12; a stale 1.3.11 would TypeError at runtime). - bootstrap_commands: sanitize host-inherited PYTHONPATH (save MF_ORIG_PYTHONPATH, drop bare host stdlib dirs via the awk pipeline) to match the nflx base CondaEnvironment; the prior bare passthrough let host stdlib shadow the conda stdlib and lost the env-escape restoration point. (Codex also flagged --no-build-isolation dropping build-requires; that is a deliberate trade-off — Pass A installs all runtime deps so the sdist setup.py can import them, e.g. deepspeed needs torch at build time, and the builder has no PyPI for isolation to fetch from. Matches the inline implementation.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…CI installs local netflixext - P1 datastore: restore the `and self._storage` guard in conda.py lazy_fetch_packages and the _create conda section (the OSS migration dropped it; the inline impl has it). Without it, a non-local deploy datastore (e.g. s3) writes cache_info into the manifest; the prebuilt build container runs Conda(datastore="local", _storage=None), so "cache" wins the source race but the cache-download block is a no-op, leaving pypi packages unfetched -> "Local file expected" failure. With the guard, cacheless contexts fall through to the package's web URL. Non-prebuilt callers (storage present) are unchanged. - CI: test-prebuilt.yml installs the checked-out metaflow-netflixext before metaflow-prebuilt, so the `>=1.3.12` dependency resolves to this PR's package (the new APIs) instead of unpublished PyPI, and CI tests the PR's netflixext. - prebuilt_build_install: document the --no-build-isolation build-requires limitation (sdists needing build deps beyond setuptools/wheel); matches the inline impl, tracked as a follow-up. Codex's sdist-hash finding was a false positive: the archive hash is carried in the spec URL (url#hash), identical in the defer and non-defer paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt GPU cache-key follow-up - P2 deferred-sdist identity (refines the round-1 per-image scoping): match the deferred records by exact sdist filename AND restrict to pypi packages, instead of (name, version). Adds "filename" (= _spec.filename) to the deferred record in pip_resolver. Prevents a conda package or a same-name/version package from a different source from pulling the wrong env's sdist into an image. - P1 GPU/CPU image cache key: documented as a TODO follow-up on _image_cache_key. It's pre-existing (the inline impl has the same gap) and a proper fix needs the gpu dimension in the key + the registry image tag + bootstrap_commands (an image-tag schema bump) — out of scope for this port; tracked for a dedicated PR. Codex's non-remote-steps finding was a false positive: bootstrap_commands is only invoked from remote-compute submission paths; local steps activate conda via _METAFLOW_CONDA_ENV / task_pre_step and never reach the "image not registered" raise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h; align Kaniko GCS path
- P1 embedded wheels (CRITICAL for git/local package support, e.g. transnetv2):
lazy_fetch_packages skipped local files when mode=="remote" (`if self._mode !=
"remote":`). The prebuilt build container runs Conda(mode="remote") and registers
embedded wheels (for non-web-downloadable packages) as local files, which were
then ignored -> fell through to the fake non-web URL -> build failed before Pass
A. Honor an explicitly-set local file even in remote mode, guarded by
os.path.isfile (no-op for normal remote tasks, which set no local files). This
restores the guard the OSS migration dropped (the inline impl has it).
- P2 Kaniko GCS path: drop empty path components so blob_path (upload) stays
aligned with context_url (built from bucket.rstrip("/")) for buckets like
gs://bucket/prefix/ — otherwise the context is uploaded to prefix//<key> and
Kaniko can't find it.
- P1 image identity (GPU/CPU base + target arch) and P2 cross-arch named-env
resolution: documented as a tracked image-identity follow-up (TODOs on
_image_cache_key and the alias-resolution path). Pre-existing; a proper fix is an
image-tag schema bump across the ImageRegistry contract + bootstrap_commands,
out of scope for this port. In a single deploy all remote steps share one
arch + base, so these are latent edge cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e-hit safe); pkg/path fixes
- P1 cache-hit deferred sdists: previously the deferred-sdist list was sourced
from the resolver's transient list, which is EMPTY on a cache hit (a re-deploy
loads the env from the conda manifest and skips resolution) — so the rebuilt
image emitted no deferred_builds.json while create_for_step(only_binary=True)
still skipped the cached .tar.gz, silently producing an image missing that
package. Now derive deferred sdists DIRECTLY from resolved_env.packages (pypi,
url_format != ".whl", web-downloadable) in _get_or_build_image — correct on
both fresh and cache-hit deploys, and naturally per-image. Removed the now-dead
resolver-side deferred_sdists recording (pip_resolver, envsresolver) and the
_envs_resolver_ref stash; the resolver still defers the *build* (keeps sdists
as source specs in the resolved env).
- P2 named-env @{VAR}: apply sub_envvars_in_envname before env_id_from_alias so
'env-@{NAME}' resolves the concrete alias, as standard conda does.
- P2 Kaniko S3 prefix: normalize trailing slash (align with the GCS fix and
context_url's bucket.rstrip("/")).
- P2 python_requires: bump to >=3.10 to match metaflow-netflixext>=1.3.12 (avoid
pip selecting this wheel on 3.8/3.9 then failing the dependency).
- Image identity (GPU/arch) + cross-arch named-env: tracked as documented
follow-ups (image-tag schema bump).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…omplete named-env @{VAR} fix
- P1: Pass B (_run_deferred_sdist_builds) now prefers the local sdist artifact
that Pass A's lazy_fetch already downloaded and hash-verified, instead of
re-downloading the raw URL (which carries no creds/hash and could differ from
the resolved env). Falls back to the URL only when no local file is present,
so public-PyPI behavior is unchanged.
- P2 named-env @{VAR}: bootstrap_commands now applies sub_envvars_in_envname
before _named_state_key, matching _build_prebuilt_images. Without it a build
stored under the substituted alias would not be found at bootstrap ("image not
registered") for names like 'env-@{NAME}'.
Cross-arch named-env resolution remains the documented image-identity follow-up
(env_id_from_alias resolves for the deploy host arch).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…odeBuild region, mutable-URL hash) These are P2 edge cases in non-default code paths, none affecting the primary Netflix newt path or the validated public targets: - buildx cross-arch --platform (folds into the image-identity follow-up), - ECR+CodeBuild region precedence (codebuild-region follow-up), - deferred-sdist identity for *mutable* URLs (the sha256 is carried on the spec URL, so identity is stable per-URL; mutable URLs are a PyPI anti-pattern). Added in-code TODO/NOTEs for traceability rather than expanding this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…le (lazy_fetch) The round-4 embedded-wheel fix added `and os.path.isfile(local_path)` to the lazy_fetch local-file check. That regressed the conda integration suite (tests/plugins/conda/create_env_test.py), which registers a non-web package's local file at a tempfile.mktemp() path that is not physically created, then expects lazy_fetch to honor it — with the isfile guard the file is skipped and resolution errors "No non-web source for non web-downloadable package". Revert to the original truthy `if local_path:` check (keeping the mode-guard removal so remote-mode tasks still honor explicitly-registered local files). The prebuilt embedded-wheel case is unaffected: those wheels are COPY'd into the build context and registered with a real on-disk path, so the truthy check honors them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… for Pass B; doc cross-arch limit
- P2 deferred-sdist hash: the deferred sdist spec is now constructed with
hashes={url_format: parse_result.hash} (matching the wheel branch), so its
sha256 contributes to the resolved-env full_id and Pass A/B can verify the
bytes — instead of full_id seeing "filename#None" (mutable/direct-URL sdists
could otherwise reuse the same env/image identity for different content).
- P2 Pass B PATH: the deferred-sdist build subprocess now runs with <env>/bin
prepended to PATH, so a build that invokes tools Pass A installed into the env
(compilers, cmake/ninja, console scripts) finds them rather than only the
Docker build image's PATH.
- P2 cross-arch sdist deferral: documented as a known limitation. pip's
--platform (used for cross-arch resolution) requires --only-binary=:all:, so a
deferred sdist-only dep can't be resolved cross-arch (e.g. arm64->linux-64);
the common same-arch workbench->Titus path is unaffected. A proper fix needs
arch-independent sdist metadata extraction (a follow-up).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…step deferral
- P2: the deferred-sdist derive now also requires `not p.local_file(".whl")`, so
a package that kept a source url_format but already has a built wheel
(pylock/uv/legacy cache) is installed directly in Pass A rather than being
reclassified as a deferred sdist and rebuilt in Pass B.
- P2: documented that sdist deferral applies to all of a flow's steps; a local /
non-remote step builds the deferred sdist at create time from the lazy-fetched
source (works for buildable sdists; same build-requires limitation as Pass B).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g); document pylock limit
The round-9 guard (`and not p.local_file(".whl")`) excluded source-url_format
specs that already have a built wheel from the deferred list, but those packages
are web-downloadable so _gather_embedded_wheels doesn't embed them either —
leaving them neither deferred nor embedded, i.e. silently missing from the image.
That silent failure is worse than the round-8 behavior (defer -> Pass B rebuilds
the sdist, which fails loudly if it can't build). Revert the guard.
This edge only arises for pylock/conda-lock-resolved pypi sdists (source
url_format + an already-built wheel); the supported pip `@pypi` defer path never
produces such specs (it doesn't build sdists on the deploy machine). Properly
reusing the deploy-built wheel needs the container's only_binary install to
prefer an embedded .whl over the source url_format — documented as a follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nd 9/10 edge)
A pypi spec can keep a source url_format (".tar.gz") while a resolver
(pylock/conda-lock) already built a wheel for it. Previously such a package was
either deferred (rebuilt from source in Pass B — redundant, could diverge) or,
after the round-9 guard, dropped entirely (silently missing). Handle it properly:
- deferred_sdists derive: defer only web-downloadable source packages with NO
built wheel (add `local_file(".whl") is None and cached_version(".whl") is None`).
- _gather_embedded_wheels: broaden candidates to packages that need a built wheel
embedded — non-web (git/local) AND web-source-with-a-cached-wheel — fetch the
wheel on the deploy machine (where _storage is live, so cached_version(".whl")
resolves and local_file(".whl") is populated), embed the .whl (never the source),
and record url_format=".whl".
No _create change needed: _ALL_PYPI_FORMATS lists ".whl" first, so _create's
pypi_paths loop picks the embedded .whl over a source url_format and only_binary
keeps it. _register_embedded_wheels is unchanged (it add_local_file's the
record's url_format, now always ".whl").
Verified by an A-E classification trace: A web-fetched, B/D/E embedded, C
deferred (deepspeed), conda excluded — disjoint and complete; create_env_test
(cached_version None + local_file(".whl") set) is unaffected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e deferring (Codex r11)
Two fixes for the deferred-sdist / cache interaction in --environment=prebuilt:
P1 (build-container storage): prebuilt_build_install runs inside the Docker
build container, which has the resolved-env manifest but NOT the deploy
machine's package cache. With the OSS-default CONDA_IGNORE_CACHING_DATASTORES=[],
Conda(datastore_type="local") builds a LocalStorage over that empty cache, so
lazy_fetch prefers it and fails instead of downloading conda/wheel packages
from their web URLs. Force the container's "local" datastore to non-caching by
setting METAFLOW_CONDA_IGNORE_CACHING_DATASTORES=["local"] before importing
metaflow (from_conf freezes config at import; list configs parse as JSON), so
_storage is None and the storage guards fall through to web + embedded wheels.
Proven: Conda(local)._storage is LocalStorage without the var, None with it.
P2 (probe cached wheels before deferring): the defer branch previously removed
deferred keys from to_build_pkg_info BEFORE build_pypi_packages, skipping its
cache probe. A package with an already-built wheel in the cache got deferred +
rebuilt in the container instead of reused. Now collect the deferred keys and
pass them to build_pypi_packages via defer_keys: the cache probe still runs
(attaching cached_version('.whl') so the prebuilt derive embeds the wheel and
excludes it from deferred_sdists), and only the expensive build is skipped for
sdists with no cached wheel. A no-storage fallback preserves the historical
defer path. Default (defer off) is byte-for-byte unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e base-image copies (Codex r12) _generate_dockerfile previously only wrote + COPYed deferred_builds.json when the env had deferred sdists or embedded wheels. But the in-container installer (prebuilt_build_install) treats ANY file at the absolute path /app/deferred_builds.json as current. If the configured base image (METAFLOW_PREBUILT_BASE_IMAGE — arbitrary user config, possibly a prior prebuilt image) already carries a stale hand-off there, a build with no deferred work would consume those stale records (wrong env) or fail schema validation. Always write the current env's hand-off (empty sdists/wheels when there is nothing deferred) and always COPY it, so it overwrites any inherited copy. An empty hand-off is a valid no-op for the installer. Embedded-wheels dir COPY stays gated on having wheels; a stale inherited wheels dir is harmless because the current (empty) hand-off lists no wheels to read. Updates test_generate_dockerfile to assert the hand-off is always present (empty) + COPYed, and adds a test for the deferred sdists + embedded wheels case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_gather_embedded_wheels embeds a BUILT WHEEL for pypi packages the build
container can't fetch/build. It called lazy_fetch_packages on the package spec
itself, then searched for a local .whl. But lazy_fetch materializes only a
package's single most-preferred source, and a local file outranks a cached
version. A source-format package that carries a local source archive (cases D/E
— e.g. a local/git sdist registered during resolution) on a resolve where the
wheel is a cache HIT therefore had its sdist materialized and the cached wheel
left unfetched, so the wheel search found nothing and the build failed loudly on
an otherwise valid environment.
Fetch the wheel through a dedicated spec instead:
- url_format == ".whl" (case B, the package IS the wheel): fetch it directly.
- source package with a cached wheel (D/E): build a synthetic wheel-only,
cache-only PypiPackageSpecification (no source local file) so the cached
".whl" is the preferred/only source and is always materialized.
- non-web source with no cached wheel: fall through to the existing loud
failure (it can be neither fetched nor built in the container).
Adds a behavioral regression test (test_gather_embedded_wheels_prefers_cached_
wheel_over_local_sdist) with a faithful lazy_fetch stub; it fails on the old
logic (CondaException) and passes on the fix.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e setuptools<82 in Pass B (Codex r14) P2-A (pip_resolver): a deferred sdist's source spec becomes the resolved-env identity (it is not replaced by a built wheel). For a downloadable sdist with no content hash (e.g. a direct archive URL with no #sha256), the env full_id would depend on `filename#None`, so different bytes at the same URL could silently reuse the same prebuilt image. Only defer sdists that carry a hash; leave hashless ones to build_pypi_packages, which builds the wheel and derives a stable content-hash identity (the pre-defer behavior). P2-B (prebuilt_build_install): Pass B builds sdists with --no-build-isolation, so the build backend is the env's OWN setuptools. setuptools>=82 dropped pkg_resources, which legacy setup.py files import — the deploy-side builder constrains setuptools<82 for exactly this, but Pass B only pinned <82 when setuptools was missing. Make the pre-check version-aware: require an importable wheel + setuptools major < 82, downgrading/installing setuptools<82 otherwise. (Verified against conda-forge's python 3.14 env, which ships setuptools 82.0.1.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…te the runtime env (Codex r15) The r14 fix downgraded setuptools<82 directly into the target env so Pass B's --no-build-isolation builds could use it. But that mutated the RESOLVED runtime environment: an env that resolved/pinned setuptools>=82 (the conda-forge default ships 82.x) would be silently downgraded while the image still carries the original .metaflowenv full_id, so tasks would run with a different setuptools than the manifest claims. Treat the <82 pin as transient: when a setuptools is already present (and the probe is unsatisfied), record its version, build with <82, then reinstall the resolved version after the builds — the built wheels are already installed and don't depend on the build-time setuptools. Restoration failure raises (refuse to ship an env that disagrees with its manifest) rather than degrading silently. A previously-absent setuptools is just added (nothing resolved to preserve). Factors the uv/pip install invocation into a local _install_args helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, not is-not-None) (Codex r16)
_ensure_remote_conda gated the datastore-installer path on
`CONDA_REMOTE_INSTALLER is not None`. An EMPTY string ("") is not None, so it
took _install_remote_conda — which requires a datastore _storage to fetch the
installer. The prebuilt build container deliberately sets
METAFLOW_CONDA_REMOTE_INSTALLER="" to disable the datastore installer and runs
with _storage=None (it self-installs micromamba from the web), so the
`is not None` check routed it to _install_remote_conda and raised
("Downloading conda remote installer ... is unimplemented"), failing every
generated prebuilt image build.
Widen the check to truthy so an empty installer correctly takes the
self-install (_ensure_micromamba) path. A configured (non-empty) installer is
unchanged. This restores an enabler-fix the OSS migration dropped (same class as
the storage guard); adds ensure_remote_conda_test.py guarding both branches
(it fails on the old is-not-None check, passes on the fix).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ver mutate the env (Codex r17) Supersedes the r14b/r15 approach (install setuptools<82 into the env, then reinstall the resolved version). r17 correctly flagged that the restore used pip/uv, layering a PyPI setuptools over the conda-provided one — so the shipped image no longer matched the resolved conda manifest (wrong provenance / hash). Root fix: don't touch the env at all. When the env's setuptools is missing or >=82, stage setuptools<82 + wheel into a temp dir and prepend it to the build subprocess' PYTHONPATH (a build overlay). Pass B's --no-build-isolation build then imports setuptools<82 from the overlay while runtime deps (e.g. torch) still resolve from the env's site-packages; the env's own conda setuptools is untouched, so the image stays byte-identical to its manifest full_id. The overlay is removed in a finally after the builds (transient build tooling, never shipped). Collapses the three setuptools rounds (r14b/r15/r17) into one no-mutation design. Note: whether uv/pip honor the overlay PYTHONPATH for the --no-build-isolation build backend is best confirmed by the L3 image build (deepspeed); the design is correct-by-construction (no env mutation) regardless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… only TYPE=="conda" (L3 finding) L3 ground-truth: `--environment=prebuilt` + @pypi/@conda failed immediately with "The @pypi decorator requires --environment=conda". The netflixext step/flow decorators gated on `environment.TYPE != "conda"`, which rejects PrebuiltCondaEnvironment (a CondaEnvironment subclass with TYPE="prebuilt") — so the prebuilt environment was unusable with @pypi/@conda at all. Nothing registered "prebuilt" as a supported env; static review (the whole Codex cycle) missed this because it is a decorator/registration gap, not a logic bug in the changed files. Widen the three checks (PackageRequirementStepDecorator.step_init, CondaEnvInternalDecorator/StepRequirementMixin step_init, and the flow PackageRequirementFlowDecorator.flow_init) from `TYPE != "conda"` to `not isinstance(environment, CondaEnvironment)`. This accepts any conda-based environment (incl. downstream subclasses like prebuilt) WITHOUT netflixext hardcoding "prebuilt" (upstream stays agnostic of the downstream package); a non-conda env (e.g. nflx, which subclasses MetaflowEnvironment) is still rejected. Verified end to end: the prebuilt flow then resolves + builds images. Adds decorator_env_compat_test.py (conda subclass accepted, plain conda accepted, non-conda rejected); it fails on the old TYPE check, passes on the fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_run_deferred_sdist_builds (Pass B) logs progress via _echo(..., timestamp=False), but echo_always() forwards **kwargs to click.secho() -> style(), which has no 'timestamp' parameter -> TypeError. Pass B would crash on its first progress line the moment it ran on a deferred sdist with no embedded/cached wheel. It never surfaced in the Netflix flow because the resolver constrains setuptools<82 and the S3 cache holds built wheels, so sdists get embedded (Pass A) and Pass B never fires. Fix: _echo() drops the unsupported 'timestamp' kwarg (nl=False stays valid for secho). Add tests/unit/test_prebuilt_passb.py (no prior coverage of the in-container build): - _SETUPTOOLS_CHECK probe: exit 0 (<82), 4 (>=82), 3 (missing) - _run_deferred_sdist_builds compiles + installs a real deferred sdist - the setuptools<82 overlay rescues a deliberately-incomplete build toolchain - no-op when the deferred-sdist list is empty (the normal Netflix case) Found by running the new tests; verified against the real metaflow echo_always/secho signatures (codex-confirmed). Full prebuilt unit suite: 65 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uires
Two latent-bug fixes flagged by TODO/KNOWN-LIMITATION in the prebuilt code.
1) image-identity (was: _image_cache_key keyed on req_id/full_id only)
Two steps that share an env but need a DIFFERENT base image (CPU vs GPU via
@resources(gpu=1)) or a different arch collided on one cached image AND one
registry tag — a GPU build could overwrite the CPU image's tag, and a GPU step
could run on a CPU base. Split the identities:
- _env_cache_key(env_id) = req_id_full_id (base-INDEPENDENT) keys
_prebuilt_env_paths + the runtime bootstrap lookup (the conda env/env_path
are identical across bases; bootstrap can't know the base at runtime).
- _image_variant(base_image, arch) + _image_dedup_key key _prebuilt_images
(in-process dedup) AND are appended to the registry push/pull tags via
_tag_with_variant, so distinct-base/arch images get distinct tags.
- bootstrap_commands validates against _prebuilt_env_paths (env key), not
_prebuilt_images (now keyed by env+variant, which bootstrap can't rebuild).
PREBUILT_IMAGE_SCHEMA_VERSION bumped v28 -> v29.
2) Pass B build-requires (was: KNOWN LIMITATION — sdists needing Cython/maturin/
setuptools_scm/etc. failed under --no-build-isolation)
_build_requires_from_sdist() reads [build-system].requires from the sdist's
pyproject.toml (tomllib/tomli or a regex fallback) and stages them into the
build overlay alongside the controlled setuptools<82 + wheel before building.
Overlay creation refactored into a lazy _ensure_overlay/_overlay_install, now
fully inside try/finally so it's always cleaned up. (The overlay is shared
across a single image's sdists — consistent with --no-build-isolation; not
per-sdist isolation.)
Tests: tests/unit/test_prebuilt_conda_environment.py (TestImageIdentity: variant
distinguishes base/arch, dedup key + tag don't collide for a shared env, bootstrap
resolves via the env key when only a variant image key exists) and
tests/unit/test_prebuilt_passb.py (build-requires extracted + staged; an sdist
whose setup.py imports a build-only dep builds). 74 prebuilt unit tests pass.
Both codex-reviewed CORRECT. image-identity validated end-to-end on real Titus:
PrebuiltImageIdentityFlow/2 — a CPU step and a GPU step that share full_id
67c32f7f got DISTINCT images (CPU: nvidia-smi absent; GPU: nvidia-smi present),
proving the collision is gone.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n TODOs) Resolves the two remaining open TODOs in the prebuilt code. 1) TODO(image-identity) cross-arch build platform (was: buildx built for the builder's DEFAULT platform, so an arm64 deploy of a linux-64 step baked an arm64 image whose in-container arch_id() mismatched the resolved manifest). build_and_push() gained `target_platform` (Optional, default None = builder default / unchanged same-arch). The caller derives it from the resolved env's arch (_docker_platform_for_arch: linux-64->linux/amd64, linux-aarch64-> linux/arm64, else None). buildx/docker emit `--platform`; kaniko emits `--custom-platform`; CodeBuild + the Netflix NewtBuildService accept it but build on a fixed remote arch (Cloudbuild=linux-64), so they ignore it. The kwarg is on EVERY build_and_push override (incl. NewtBuildService) so the unified call can't TypeError. 2) TODO(codebuild-region): a CodeBuild project in a different region than ECR built in the wrong region. EcrRegistry.push_credentials now exposes a `codebuild_region` (from METAFLOW_PREBUILT_CODEBUILD_REGION); CodeBuildService prefers it (codebuild_region -> region -> env), falling back to the ECR region when unset (unchanged behavior). Also corrects a stale docstring: _run_deferred_sdist_builds still described the build-deps thing as a KNOWN LIMITATION after it was implemented; now documents the actual behavior + the residual shared-overlay/binary-tool caveats. Tests: buildx --platform (set / default-absent), _docker_platform_for_arch mapping. 77 prebuilt unit tests pass. codex-reviewed (kaniko flag spelling fixed per review). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rity/robustness) - python_requires>=3.10 (blocking): add a setup.py comment (both packages) that the floor mirrors the internal minimum (<3.10 dropped internally); no 3.10-only construct in the code — _parse_build_system_requires has tomllib/tomli/regex. - Pass B shared overlay: _overlay_install now warns when a build dep is re-staged at a DIFFERENT version (surfaces the documented shared-overlay conflict instead of a silent miscompilation). - deferred_sdists: comment that "url" is the guaranteed source (is_downloadable_url filter) and "filename" is only a best-effort hint (harmless if None). - _gather_embedded_wheels Case D/E: comment that p.url is a placeholder (is_real_url=False + cached wheel => p.url is never fetched). - Pass B overlay cleanup: comment that the on-failure rmtree + the finally rmtree double-remove is intentional and safe (ignore_errors=True). No functional change to validated paths (comments + one diagnostic warning). 77 prebuilt unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This branch was rebased onto main, which released metaflow-netflixext 1.3.12 WITHOUT the generic sdist-build hooks in this PR. Bump to 1.3.13 so the new only_binary / defer_pypi_sdist_build APIs ship under a fresh version, and pin metaflow-prebuilt's floor (and the CI install comment) to match. Also fixes a stale `metaflow_extensions.nflx` doc comment left by the namespace move (#93). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep newly added OSS tests in the netflixext namespace so pure OSS CI does not require an internal package providing metaflow_extensions.nflx. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
metaflow-netflixext 1.3.13 is already published from the nflx_compat shim release, so the generic sdist/prebuilt hooks need a fresh netflixext version for the next release. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
c4777c6 to
8fdae0f
Compare
Summary
Adds generic conda sdist-build-deferral hooks to metaflow-netflixext and ships the complete metaflow-prebuilt orchestration package for `--environment=prebuilt` — a workflow that pre-bakes resolved conda environments into Docker images so remote tasks bootstrap in seconds rather than minutes.
metaflow-netflixext (conda core changes)
Two primitives, each default-off — non-prebuilt resolve/create is byte-for-byte unchanged:
Conda core contains zero references to `prebuilt`, `deferred_builds.json`, or `schema_version`.
metaflow-prebuilt (owns the protocol)
Also restores deepspeed (sdist) and transnetv2 (git) support that was developed inline in the internal repo but never ported to the OSS packages.
Validation
🤖 Generated with Claude Code