Integrate Data API into execution pipeline#66
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the remote execution capabilities by integrating a new Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request integrates a new Data API to handle data dependencies for remote function execution, allowing Data objects to be passed as function arguments and used as volumes. While the implementation is thorough, a high-severity path traversal vulnerability was identified in the _download_data function in remote_runner.py, which could allow an attacker to overwrite arbitrary files on the remote execution pod. Additionally, there are correctness issues in the data download logic on the remote runner that could cause problems with certain GCS paths, and a minor refactoring is suggested for improved readability.
52c8599 to
a974a80
Compare
2079b1c to
4e45519
Compare
4313077 to
a05a214
Compare
divyashreepathihalli
left a comment
There was a problem hiding this comment.
Thank you for the PR - just reviewed remote_runner and packager for now
| # Volume-mounted data refs are handled by Kubernetes, skip download | ||
| if obj.get("mount_path") is not None: | ||
| return obj | ||
| local_dir = os.path.join(DATA_DIR, str(counter)) |
There was a problem hiding this comment.
Every time _resolve encounters a Data reference, it increments the counter and blindly initiates a fresh download. If a user passes the exact same dataset multiple times (e.g., train(dataset=Data("./cache"), eval=Data("./cache"))), the remote runner will download the entire potentially gigabyte-sized bucket multiple times to /tmp/data/0 and /tmp/data/1. We should wrap this in a download_cache = {} dictionary keyed by obj["gcs_uri"] at the top of resolve_data_refs
There was a problem hiding this comment.
Thanks for catching this, but since this doesn't affect correctness we can tackle this separately in another PR for sake of simplicity. Created #69 for tracking.
|
|
||
| blobs = bucket.list_blobs(prefix=prefix + "/") | ||
| count = 0 | ||
| for blob in blobs: |
There was a problem hiding this comment.
This loop executes synchronously. Downloading a framework dataset consisting of 100,000 tiny .png or audio files will take long and may even exceed the pod start timeout due to individual GCS HTTP request latency. maybe wrap the download_to_filename command inside a concurrent.futures.ThreadPoolExecutor to execute a batch of downloads concurrently?
There was a problem hiding this comment.
Makes sense, but let's handle this in a separate change since it doesn't affect correctness. It will keep this PR's size manageable. Created #70 for tracking.
4e45519 to
ca6f6c3
Compare
a05a214 to
c11bf88
Compare
ca6f6c3 to
50e8feb
Compare
|
not sure why the rest of the checks are not running |
cf6c4bc to
8a24242
Compare
Wires Data objects through packager, core API, execution backend, and remote runner. Supports Data as function args (including nested in lists/dicts), volumes with fixed mount paths, and excludes data paths from working directory zips.
8a24242 to
fdb7053
Compare
Integrate Data API into execution pipeline
Depends on #65
Summary
Dataclass into the full execution pipeline: from the@run()decorator through artifact preparation, serialization, and pod-side resolutionvolumesparameter to@keras_remote.run()for mounting data at fixed paths on the remote podDataobjects as function arguments (including nested in lists/tuples/dicts), which are transparently resolved to filesystem paths on the podNew
volumesparameter on@keras_remote.run()The
run()decorator now accepts avolumesdict mapping absolute mount paths toDataobjects:Validation:
volumesmust be adict(raisesTypeErrorotherwise)/— absolute mount paths (raisesValueErrorotherwise)Datainstances (raisesTypeErrorotherwise)Two ways to pass data to remote functions
Method 1: Data as function arguments
Dataobjects passed as positional or keyword arguments are transparently replaced with local filesystem paths on the remote pod. The user's function receives plain strings:Data objects can be nested inside lists, tuples, and dicts at any depth:
Method 2: Volumes (fixed-path mounts)
Data is downloaded to a specific absolute path on the pod before function execution:
Mixed usage
Both methods can be combined:
End-to-end data flow
Data(...)and passes them as function args or involumes_prepare_artifactsinexecution.py):Dataobject{namespace}/data-cache/{hash}/with content-addressed cachingDataobjects with serializable__data_ref__dicts in args/kwargsvolumeslist and data-ref-replaced args/kwargsremote_runner.py):resolve_volumes()downloads volume data to their fixed mount pathsresolve_data_refs()recursively resolves__data_ref__dicts in args/kwargs to local pathsstrpaths — completely unaware of theDataabstractionExecution pipeline changes
JobContextdataclass — new field:_prepare_artifacts()now:Dataviaupload_data()and creates data ref dictsDataobjects viaextract_data_refs()Dataand creates ref dictsDataobjects in args/kwargs with ref dicts viareplace_data_with_refs()exclude_pathstozip_working_dir()andvolumestosave_payload()Packager changes
New functions:
extract_data_refs(args, kwargs)— recursively scans forDataobjects at any nesting depth, returning(data_obj, position_path)tuplesreplace_data_with_refs(args, kwargs, ref_map)— recursively replacesDataobjects with their serializable ref dictsUpdated functions:
zip_working_dir()— now acceptsexclude_pathsset to skip Data-referenced pathssave_payload()— now accepts optionalvolumeslist, included in the serialized payloadRemote runner changes
New functions on the pod side:
resolve_volumes(volume_refs, storage_client)— downloads each volume's data from GCS to its declared mount pathresolve_data_refs(args, kwargs, storage_client)— recursively walks args/kwargs, downloads data from GCS toDATA_DIR/{counter}/, replaces ref dicts with local paths_download_data(ref, target_dir, storage_client)— downloads blobs from a GCS URI prefix, skipping.cache_markersentinels and directory markers