Skip to content

Conversation

@speediedan
Copy link
Owner

@speediedan speediedan commented Dec 3, 2025

Overview

This PR enhances the FTS build system by migrating from pip to uv for dependency management and standardizing CI workflows with reusable composite actions. The changes also include fixes for CI issues discovered during testing.

Key Changes

1. UV Migration

Core Changes:

  • Replaced pip install with uv pip install across all documentation and workflows
  • Removed USE_CI_COMMIT_PIN environment variable in favor of UV_OVERRIDE mechanism
  • Lightning commit pinning is now handled via requirements/ci/overrides.txt and the UV_OVERRIDE environment variable
  • All workflows now use astral-sh/setup-uv@v7 for consistent uv installation

Files Updated:

  • README.md - Updated source installation examples
  • tests/README.md - Updated testing instructions and coverage collection examples
  • Makefile - Changed pip → uv pip for test/docs targets
  • .github/CONTRIBUTING.md - Updated development setup instructions with nightly torch option
  • docs/source/index.rst - Updated installation examples
  • docs/source/install/dynamic_versioning.rst - Updated CI commit pinning docs
  • .github/workflows/ci_schema.yml - Added uv setup, use uv pip install
  • src/finetuning_scheduler/dynamic_versioning/utils.py - Updated error messages

2. CI Workflow Modernization

Reusable Composite Action:

  • Created .github/actions/install-ci-dependencies/action.yml - Reusable action for dependency installation
  • Supports both latest and oldest dependency matrices
  • Generates UV_OVERRIDE from requirements/ci/overrides.txt for Lightning commit pinning
  • Used by ci_test-full.yml and code-checks.yml

Workflow Updates:

  • .github/workflows/ci_test-full.yml - Simplified using composite action
  • .github/workflows/code-checks.yml - Added parallel checking (ruff, pyright, pre-commit)
  • .azure-pipelines/gpu-tests.yml - Updated for Azure GPU testing
  • .github/workflows/documentation-links.yml - NEW: ReadTheDocs PR preview action that automatically adds documentation preview links to PRs that touch docs, README, or source files (docstrings)

3. Requirements Management

Consolidation to pyproject.toml:
All optional dependencies are now defined in pyproject.toml with extras: cli, extra, examples, ipynb, all.

New Files:

  • requirements/ci/requirements.txt - Locked CI dependencies (latest, excludes torch when nightly mode)
  • requirements/ci/requirements-oldest.txt - Locked CI dependencies (oldest matrix, always stable torch)
  • requirements/ci/overrides.txt - Lightning commit pin specification
  • requirements/ci/torch-nightly.txt - PyTorch nightly version configuration
  • requirements/ci/torch_override.txt - Generated file with nightly torch version and installation instructions
  • requirements/utils/lock_ci_requirements.sh - Script to regenerate locked requirements

PyTorch Nightly Handling (Two-Step Approach):
The lock script now handles torch nightly with a secure two-step installation approach:

When torch-nightly.txt is configured:

  • Lockfile Generation:

    • Creates temporary override to pin torch to nightly version for dependency resolution
    • Uses --prerelease=if-necessary-or-explicit for prereleases
    • Uses --index-strategy unsafe-best-match for lockfile generation only (security-isolated to maintainer machines)
    • Excludes torch from output via --no-emit-package torch
    • Generates torch_override.txt with version and installation instructions
  • User Installation (Secure Two-Step):

    1. Install PyTorch nightly first: uv pip install --prerelease=if-necessary-or-explicit torch==<version> --index-url https://download.pytorch.org/whl/nightly/cu128
    2. Install FTS with Lightning commit pin: UV_OVERRIDE=requirements/ci/overrides.txt uv pip install -e ".[all]"
  • CI Installation:

    • Docker images pre-install nightly torch
    • GitHub Actions uses CPU nightly with same two-step approach
    • Azure Pipelines uses Docker with pre-installed torch

Without torch-nightly.txt:

  • Uses stable torch from PyPI via --torch-backend=cpu or --torch-backend=auto
  • Removes stale torch_override.txt if present

Removed Files:

  • requirements/lightning_pin.txt - Superseded by requirements/ci/overrides.txt
  • requirements.txt (repo root) - No longer needed
  • requirements/base.txt - Dependencies now in pyproject.toml and dynamic_versioning/utils.py
  • requirements/examples.txt - Now in pyproject.toml [project.optional-dependencies]
  • requirements/extra.txt - Now in pyproject.toml [project.optional-dependencies]
  • requirements/cli.txt - Now in pyproject.toml [project.optional-dependencies]
  • requirements/ipynb.txt - Now in pyproject.toml [project.optional-dependencies]
  • requirements/devel.txt - Now in pyproject.toml [dependency-groups]
  • requirements/test.txt - Now in pyproject.toml [dependency-groups]
  • .actions/assistant.py - No longer needed (was used for requirements processing)

4. Build Script Enhancements

scripts/build_fts_env.sh:

  • Added --oldest flag for building with oldest supported dependencies (Python 3.10)
  • Added --venv-dir flag for explicit venv base directory
  • Added pyright check with graceful failure (replaces mypy)
  • Improved uv cache management and error handling

scripts/gen_fts_coverage.sh:

  • Added --oldest flag to pass through to build script
  • Added --no-special flag to skip standalone/experimental tests

New Scripts:

  • scripts/manage_standalone_processes.sh - Wrapper for isolated nohup execution with logging
  • scripts/manage_standalone_regex.cfg - Regex patterns for output filename generation

5. Docker Updates

dockers/base-cuda/Dockerfile:

  • Fixed uv installation (system-wide at /usr/local/bin instead of user-specific)
  • Added UV_HTTP_TIMEOUT=120 for large package downloads (NVIDIA packages)
  • Simplified path management
  • Fixed venv permission ordering: Moved chmod -R 777 /tmp/venvs/fts_dev from venv creation step to the final RUN step (after all Python verification commands). This prevents __pycache__ directories created by Python execution from having root ownership, which was causing "Permission denied" errors in Azure Pipelines with userns-remapped rootless Docker.

dockers/fts-az-base/Dockerfile:

  • Added uv to Azure base image

6. Bug Fixes

tensorboardX Compatibility:

  • Updated pyproject.toml to require tensorboardX>=2.6.1 for protobuf 4.x compatibility
  • Regenerated locked requirements with correct tensorboardX version

torch.utils.collect_env Compatibility with UV:

  • Fixed src/fts_examples/cli_experiment_utils.py to handle cases where collect_env returns empty pip packages when using uv (since uv doesn't install pip by default)
  • Added defensive parsing with split("==", 1) to handle edge cases in package version strings
  • Added pip>=21.0.0 as explicit dependency for torch collect_env backward compatibility

Dynamic Versioning:

  • Simplified setup.py dependency handling
  • Updated error messages in utils.py to reference uv pip

Code Cleanup:

  • Removed commented code from src/finetuning_scheduler/strategy_adapters/base.py
  • Simplified tests in tests/test_dynamic_versioning_utils.py

Sphinx/RTD Configuration:

  • Updated docs/source/conf.py to use hardcoded mock package list instead of reading from deleted requirements files
  • Updated .readthedocs.yml to use .[examples] extra instead of separate requirements files

setup_tools.py:

  • Changed _load_requirements() default file_name from "base.txt" to "requirements.txt"

7. Documentation

New:

  • .github/copilot-instructions.md - Comprehensive copilot instructions

Updated:

  • README.md - Two-step nightly installation instructions
  • tests/README.md - Two-step nightly installation, coverage examples
  • .github/CONTRIBUTING.md - Added PyTorch nightly installation section
  • docs/source/install/dynamic_versioning.rst - Two-step nightly installation

Migration Notes

For Users

  • Replace pip install commands with uv pip install
  • The USE_CI_COMMIT_PIN environment variable is no longer used
  • Use scripts/build_fts_env.sh for development environment setup
  • For PyTorch nightly: use two-step installation (install torch first, then FTS)

For CI/CD

  • Lightning commit pinning is now via UV_OVERRIDE environment variable
  • Override content is auto-generated from requirements/ci/overrides.txt
  • Lock files should be regenerated when dependencies change using requirements/utils/lock_ci_requirements.sh
  • Nightly torch requires two-step installation (not combined with FTS install)

Validated Installation Scenarios

Scenario Torch Version Status
Two-step nightly (editable) 2.10.0.dev20251124+cu128
Two-step nightly (locked requirements) 2.10.0.dev20251124+cu128
Oldest requirements (stable torch) 2.6.0+cu124

Files Changed Summary

49 files changed, 3851 insertions(+), 645 deletions(-)

New Files (11)

  • .github/actions/install-ci-dependencies/action.yml
  • .github/copilot-instructions.md
  • .github/workflows/documentation-links.yml
  • requirements/ci/overrides.txt
  • requirements/ci/requirements-oldest.txt
  • requirements/ci/requirements.txt
  • requirements/ci/torch-nightly.txt
  • requirements/ci/torch_override.txt
  • requirements/utils/lock_ci_requirements.sh
  • scripts/manage_standalone_processes.sh
  • scripts/manage_standalone_regex.cfg

Removed Files (10)

  • .actions/assistant.py
  • requirements/lightning_pin.txt
  • requirements.txt (repo root)
  • requirements/base.txt
  • requirements/examples.txt
  • requirements/extra.txt
  • requirements/cli.txt
  • requirements/ipynb.txt
  • requirements/devel.txt
  • requirements/test.txt

Modified Files (28)

  • .azure-pipelines/gpu-tests.yml
  • .github/CONTRIBUTING.md
  • .github/workflows/ci_schema.yml
  • .github/workflows/ci_test-full.yml
  • .github/workflows/code-checks.yml
  • .readthedocs.yml
  • Makefile
  • MANIFEST.in
  • README.md
  • dockers/base-cuda/Dockerfile
  • dockers/release/Dockerfile
  • dockers/fts-az-base/Dockerfile
  • docs/source/conf.py
  • docs/source/index.rst
  • docs/source/install/dynamic_versioning.rst
  • pyproject.toml
  • scripts/build_fts_env.sh
  • scripts/gen_fts_coverage.sh
  • setup.py
  • src/finetuning_scheduler/dynamic_versioning/utils.py
  • src/finetuning_scheduler/setup_tools.py
  • src/finetuning_scheduler/strategy_adapters/base.py
  • src/fts_examples/cli_experiment_utils.py
  • src/fts_examples/stable/fts_superglue.py
  • src/fts_examples/stable/model_parallel/torchtitan_llama.py
  • tests/README.md
  • tests/helpers/expected_warns.py
  • tests/test_dynamic_versioning_utils.py

Migrate to uv pip interface for CI builds with static dependency override
support. This establishes a simpler, reproducible CI workflow.

Changes:
- Add GitHub Copilot instructions (.github/copilot-instructions.md) with
  comprehensive repository context, build commands, and development
  guidelines
- Add static CI overrides file (requirements/ci/overrides.txt) for
  Lightning commit pinning, replacing dynamic generation approach
- Add locked CI requirements (requirements/ci/requirements.txt) generated
  via uv pip compile for reproducible builds
- Add lock script (requirements/utils/lock_ci_requirements.sh) wrapper
  around uv pip compile
- Add CI composite action (.github/actions/install-ci-dependencies) for
  unified uv-based installation
- Add process management wrapper (scripts/manage_standalone_processes.sh)
  to prevent duplicate test execution during coverage collection

The override file is used with `uv pip install --override` to pin
Lightning to a specific git commit during CI builds.
Simplify the CI dependency management by:

- Filter torch directly in requirements.txt during lock generation when
  nightly is configured, eliminating the need for requirements_no_torch.txt
- All installation paths now use requirements.txt directly
- Replace USE_CI_COMMIT_PIN env var with UV_OVERRIDE pointing to
  requirements/ci/overrides.txt for Lightning commit pinning
- Add torch-nightly.txt as centralized config for nightly torch version
- Move optional dependencies to pyproject.toml (cli, extra, examples, ipynb)
- Add dependency-groups for dev and test in pyproject.toml
- Update minimum Python to 3.10 (drop 3.9)
- Add requirements-oldest.txt lock file for Python 3.10 oldest deps

The torch nightly workflow is now:
1. Edit torch-nightly.txt with version (or leave empty to disable)
2. Run lock_ci_requirements.sh to regenerate locks with torch filtered
3. CI pre-installs torch nightly from PyTorch index, then installs rest

This eliminates the complexity of runtime filtering and multiple
requirements files while maintaining flexibility for nightly testing.
@speediedan speediedan marked this pull request as ready for review December 4, 2025 19:39
Copilot AI review requested due to automatic review settings December 4, 2025 19:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR successfully migrates the FTS build and CI infrastructure from pip to uv, providing faster and more reproducible builds. The migration is comprehensive, touching CI workflows, build scripts, dependency management, and supporting Python code.

Key Changes

  • UV adoption: Replaces pip with uv throughout CI and build scripts for 10-100x faster dependency resolution
  • Locked requirements: Introduces requirements/ci/requirements.txt and requirements-oldest.txt for reproducible builds across platforms
  • Static overrides: Replaces dynamic Lightning override generation with static requirements/ci/overrides.txt file
  • Minimum Python version bump: Changes from Python 3.9 to 3.10 as minimum supported version

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
.github/actions/install-ci-dependencies/action.yml New composite action that centralizes UV-based dependency installation for CI
.github/workflows/ci_test-full.yml Updated to use new composite action, removed manual pip caching logic
.github/workflows/code-checks.yml Migrated type checking workflow to use UV and new composite action
.azure-pipelines/gpu-tests.yml Updated Azure GPU tests to use UV and static override file
requirements/ci/overrides.txt New static override file replacing dynamic generation
requirements/ci/requirements.txt New locked requirements for highest resolution (latest tests)
requirements/ci/requirements-oldest.txt New locked requirements for lowest resolution (oldest compatibility tests)
requirements/ci/torch-nightly.txt Configuration file for optional PyTorch nightly builds
requirements/utils/lock_ci_requirements.sh Script to generate locked requirements with UV
scripts/build_fts_env.sh Completely rewritten to use UV, supports venv directory configuration
scripts/gen_fts_coverage.sh Updated with logging improvements and UV support
scripts/infra_utils.sh Added substantial utility functions for venv management and from-source installs
scripts/manage_standalone_processes.sh New process management script with conflict detection
src/finetuning_scheduler/dynamic_versioning/utils.py Simplified Lightning requirement handling (commit pinning now at install time)
src/finetuning_scheduler/fts_supporters.py Bug fix for lr_lambdas synchronization edge case
src/finetuning_scheduler/strategy_adapters/base.py Commented out lr_lambdas handling (related to bug fix)
setup.py Streamlined with dependency management moved to pyproject.toml
pyproject.toml Major update: moved to Python 3.10+, added optional dependencies, added dependency groups, pyright configuration
tests/test_dynamic_versioning_utils.py Updated tests to reflect simplified commit pinning approach
tests/helpers/expected_warns.py Added new expected warnings for PT 2.10 nightly compatibility
.github/copilot-instructions.md New comprehensive repository documentation for AI assistants
dockers/base-cuda/Dockerfile Updated to use UV for package installation
Comments suppressed due to low confidence (2)

src/finetuning_scheduler/dynamic_versioning/utils.py:20

  • Import of 'tomllib' is not used.
    import tomllib

src/finetuning_scheduler/dynamic_versioning/utils.py:23

  • Import of 'tomllib' is not used.
        import tomli as tomllib  # type: ignore[import-not-found,no-redef]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@speediedan speediedan merged commit 9f178a1 into main Dec 6, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants