[Python] Support large pipeline options via file#37379
[Python] Support large pipeline options via file#37379shunping merged 22 commits intoapache:masterfrom
Conversation
Summary of ChangesHello @Mathdee, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust solution for handling large pipeline options in the Python SDK, which previously led to an 'argument list too long' error due to operating system limitations on environment variable size. By transitioning from direct environment variable passing to a file-based approach, where options are written to a temporary file and its path is referenced, the system can now accommodate extensive configurations without encountering runtime failures. This enhancement ensures greater stability and consistency across SDKs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
All linting and integration checks are passing. Ready for review |
|
Assigning reviewers: R: @claudevdm for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
@shunping can you take a pass at this since you have more context? |
|
Ack! Will do |
shunping
left a comment
There was a problem hiding this comment.
Thanks for submitting the PR. Please check out my inline comments.
…ge with JAVA and GO SDKs
|
Hey @shunping. Just checking in on this one. Let me know if there's anything else you need from my end. |
Sorry for the delay. I would want to build a custom container based on this and test before approving it. Will try to do that tomorrow. |
|
/gemini review |
|
Could you merge to the latest master and push again? @Mathdee |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable enhancement to support large pipeline options by passing them through a file, thus avoiding issues with argument length limits. The implementation in both Go and Python appears correct and follows the intended logic. I have a couple of suggestions to improve the Python code: one for exception handling to make it more robust, and another to add unit tests for the new functionality.
|
Both failed test workflow PreCommit YAML Xlang Direct and PreCommit Python ML tests with ML deps installed were unrelated to the test change here. |
|
I have verified the changes with a new container and a test pipeline on Dataflow. The pipeline options were successfully loaded via the file and the job completed as expected. During testing, I found that the INFO log message for loading options from the file was being suppressed. This occurs because the default log level is WARNING until the pipeline options are fully parsed. Therefore, I have added a fallback to INFO at the start of |
|
@Mathdee: Could you also update CHANGES.md to document this change? I suggest adding a note to the 'New Features/Improvements' section regarding the support for large pipeline options via a file. In the 'Breaking Changes' section, we should also mention that the Python SDK container's Please note that if a user pairs a new Python SDK container with an older SDK version (which does not support the file-based approach), the pipeline options will not be recognized and the pipeline will fail. In that case, we should advise users to ensure their SDK and container versions are synchronized. |
shunping
left a comment
There was a problem hiding this comment.
LGTM! Thanks a lot! Will merge when the tests are complete.
|
@shunping Awesome, thanks for taking the time to test this and for adding the log level fallback. |
|
Loosk like PreCommit Whitespace failed. Could you fix the notes to include the issue link, i.e. #37370 (not the PR link)? The error message is shown below: |
|
I fixed the notes to include the issue link instead of the PR link, just waiting on the tests to be completed now. |
|
Verified the failed workflow PreCommit Python Integration / beam_PreCommit_Python_Integration is irrelevant as well. Good job! Thank you again! @Mathdee. |
|
Thanks so much for the review and all your help getting this across the finish line @shunping. So glad we could get it officially merged today. Have a great rest of your week! |

Issue:
As described in #37370, running pipelines with large options on Dataflow causes
fork/exec /usr/local/bin/python: argument list too long.This occurs because the bootloader passes the complete JSON config via
PIPELINE_OPTIONSenvironment variable --> exceeds the OSARG_MAXlimit.The Fix:
This change uses an identical pattern from the Go SDK (Issue #27839, Commit e31e885) to Python.
pipeline_options.json) and sets thePIPELINE_OPTIONS_FILEenvironment variable.PIPELINE_OPTIONS_FILEand loads the configs from the disk if present.Outcome:
Verified with unit tests that ensured priority of file-based loading.Mention the appropriate issue in your description (for example:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.Update
CHANGES.mdwith noteworthy changes.If this contribution is large, please file an Apache Individual Contributor License Agreement.