Skip to content

Improve logging logic to improve/fix GPU performance #2252

Open
@strickvl

Description

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

[email protected]

What happened?

Users have reported a significant drop in GPU utilization (from 95% to 2%) after upgrading ZenML from version 0.32.1 to 0.44.2. This issue was observed while deploying pipelines on GCP VertexAI. Investigations suggest that the performance bottleneck is due to the logging mechanism, especially when using progress bars like tqdm. It appears that logging, particularly frequent updates from progress bars, is substantially slowing down the processing speed.

Task Description

Investigate and optimize the logging logic in ZenML, particularly for scenarios involving high GPU usage. The goal is to ensure that the logging process, including progress bars, does not adversely affect the GPU performance and overall speed of pipeline execution.

Expected Outcome

  • ZenML should maintain high GPU utilization without being impacted by the logging process.
  • Users should be able to use progress bars and other logging tools without experiencing a significant slowdown in processing.
  • Modifications should be made to allow users to control the frequency and verbosity of logs to balance between logging needs and performance.

Steps to Implement

  • Analyze the current logging mechanism and identify how it interacts with GPU-intensive processes.
  • Develop solution(s) to optimize logging, particularly when progress bars are used, to reduce their impact on GPU and overall performance.
  • Implement configurable settings for users to control the logging behavior, such as limiting log frequency or verbosity.
  • Thoroughly test the changes in scenarios with high GPU usage to ensure that the logging optimizations are effective.
  • Update documentation to guide users on how to configure logging settings for optimal performance.

Note that part of the solution might be to expose these global variables / constants better in settings via environment variables.

Additional Context

This issue is critical for users leveraging ZenML for GPU-intensive tasks, as efficient GPU utilization is key to performance in these scenarios. The solution should provide a balance between informative logging and optimal resource utilization.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions