Skip to content

Include VertexAI cluster environment for Fabric #19911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

miguelalba96
Copy link

@miguelalba96 miguelalba96 commented May 27, 2024

VertexAI cluster environment for Fabric

It includes a subclass that picks the proper CLUSTER_SPEC from a VertexAI custom training job and populate the respective environment variables necessary for DDP

from lightning.fabric.plugins.environments.vertexai import VertexAIEnvironment 

Dependencies: os, json

Before submitting

  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • [x ] Did you read the contributor guideline, Pull Request section?
  • [x ] Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)

I added documentation on the docstring, I am not entirely sure where to add more docs

  • Did you write any new necessary tests? (not for typos and docs)

  • [ x] Did you verify new and existing tests pass locally with your changes?

I tested the code running it in Vertex AI pipelines as a custom training job, to asses the efficacy of the change you can check the documentation

I tested the custom job using

google_cloud_pipeline_components.v1.custom_job.create_custom_training_job_from_component

from the official Google's documentation:

There are no substantial changes in the source code more than adding the environment in some imports

  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

yes

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19911.org.readthedocs.build/en/19911/

Sorry, something went wrong.

The following cluster environment allows fabric to be aware of the cluster specification defined on a custom training job in vertex AI

More information about the environment variables in:
https://cloud.google.com/vertex-ai/docs/training/distributed-training
added .py extension
rename VertexAIEnvironment for consistency with other environments
Added VertexAIEnvironment
Added VertexAIEnvironment to connectors
include VertexAIEnvironment in list of environments in lighting.pytorch
@github-actions github-actions bot added docs Documentation related fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels May 27, 2024
Copy link

stale bot commented Apr 26, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related fabric lightning.fabric.Fabric has conflicts pl Generic label for PyTorch Lightning package won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant