perf(core): Explicitly set PyTorch intra-op threads. by Leahlijuan · Pull Request #21 · google/ml-flashpoint

Leahlijuan · 2026-01-30T20:21:02Z

Set the torch number of threads to max(1, num_cpus // 2 // num_ranks // thread_count), help to improve writing performance and also resolve the runtime error when using tensor.copy_() with multiple write threads.

gemini-code-assist · 2026-01-30T20:21:18Z

Summary of Changes

Hello @Leahlijuan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical enhancement to how PyTorch's intra-operation parallelism is managed during data writing operations. By explicitly setting the number of PyTorch threads based on system resources and the number of concurrent writers, the change aims to significantly improve writing performance and resolve specific runtime errors that can occur when tensor.copy_() is invoked by multiple threads simultaneously. The implementation also ensures robust resource management by restoring the original PyTorch thread configuration upon completion.

Highlights

PyTorch Intra-Op Thread Management: Explicitly sets the number of PyTorch intra-operation threads within the write_data function to optimize performance and prevent runtime errors when using tensor.copy_() with multiple write threads.
Dynamic Thread Calculation: The PyTorch thread count is dynamically calculated based on the number of CPU cores, CUDA device ranks, and the specified thread_count for the write operation, using the formula max(1, num_cpus // 2 // num_ranks // thread_count).
Resource Cleanup and Safety: The original PyTorch thread count is saved and then restored in a try...finally block, ensuring that the global PyTorch thread settings are reverted after the write_data operation completes, regardless of success or failure.
Enhanced Logging: Added detailed debug logging to show the original, calculated, and applied PyTorch thread counts, along with CPU and rank information.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a change to explicitly set the number of intra-op threads for PyTorch during checkpoint writing. This is a good optimization to improve performance and prevent potential race conditions with multiple writer threads. The use of a try...finally block to restore the original thread count is also a good practice.

I've found one critical issue that could lead to a ZeroDivisionError on CPU-only machines. Please see my detailed comment.

src/ml_flashpoint/core/checkpoint_saver.py

g-husam

left some minor comments

Fixes #18 Confirmed it works by removing the license header for `CMakeLists.txt` here: https://github.com/google/ml-flashpoint/pull/27/checks?sha=7bb36ee0cf3c9d14ff2a74e532f6e39afacf0350

Clarifying where the improved read/write speedups were exactly. Also adding note on infra-agnosticism, and removing the WIP label from the doc site.

#19) - Extracts context recovery logic from DefaultMLFlashpointCheckpointLoader - Introduces _get_extra_local_objects and _get_extra_needed_objects to DefaultMLFlashpointCheckpointLoader - Updates NeMo wrapper to instantiate the new NeMoMLFlashpointCheckpointLoader --------- Co-authored-by: g-husam <husameldawi@google.com>

g-husam

small tweak to get_accelerator_count and comment placing and good to go

g-husam · 2026-02-05T22:01:29Z

src/ml_flashpoint/core/checkpoint_saver.py

        thread_count = max(thread_count, 1)
+        num_cpus = os.cpu_count() or 1
+        num_ranks = max(get_accelerator_count(), 1)
+        torch_thread_count = max(1, num_cpus // 2 // num_ranks // thread_count)


as a future improvement, we can make the percentage of CPUs used configurable. here we are assuming 50% usage (by dividing by 2), but some scenarios may desire using more or less.

src/ml_flashpoint/core/checkpoint_saver.py

src/ml_flashpoint/core/utils.py

g-husam · 2026-02-05T22:09:43Z

small tweak to get_accelerator_count and comment placing and good to go

actually one more important thing - we should add a test that catches the failure this is fixing. So if we comment out this fix, that test should fail, and with it, it should pass. I think the failure was whenever we use multiple write threads or something?

confirm test that catches the failure this is fixing

Leahlijuan · 2026-02-05T22:13:22Z

small tweak to get_accelerator_count and comment placing and good to go

actually one more important thing - we should add a test that catches the failure this is fixing. So if we comment out this fix, that test should fail, and with it, it should pass. I think the failure was whenever we use multiple write threads or something?

This won't be able to catch through unit tests. The error only happens when running actual training

src/ml_flashpoint/core/checkpoint_saver.py

src/ml_flashpoint/core/utils.py

tests/core/test_checkpoint_saver.py

Co-authored-by: g-husam <husameldawi@google.com>

feat(core): Explicitly set PyTorch intra-op threads.

bc9e8d3

Set the torch number of threads to max(1, num_cpus // 2 // num_ranks // thread_count), help to improve writing performance and also resolve the runtime error when using tensor.copy_() with multiple write threads.

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

src/ml_flashpoint/core/checkpoint_saver.py Outdated Show resolved Hide resolved

Leahlijuan added 2 commits January 30, 2026 20:36

Set num_ranks default value.

6d08ad6

Merge remote-tracking branch 'origin/main' into feature/torchthreads

5a0f9cc

Leahlijuan self-assigned this Feb 2, 2026

Leahlijuan and others added 2 commits February 2, 2026 15:22

Merge branch 'main' into feature/torchthreads

d2c72f3

Change "x or 1" to max(x, 1)

c7c9559

Leahlijuan requested review from g-husam and kkkapu February 3, 2026 16:44

g-husam reviewed Feb 3, 2026

View reviewed changes

g-husam and others added 7 commits February 4, 2026 15:53

ci: add license header validation check (#27)

ed8a83f

Fixes #18 Confirmed it works by removing the license header for `CMakeLists.txt` here: https://github.com/google/ml-flashpoint/pull/27/checks?sha=7bb36ee0cf3c9d14ff2a74e532f6e39afacf0350

docs(site): clarify perf points (#24)

136a687

Clarifying where the improved read/write speedups were exactly. Also adding note on infra-agnosticism, and removing the WIP label from the doc site.

Resolve comments.

5458ad2

Modify log

3852fcb

Merge remote-tracking branch 'origin/main' into feature/torchthreads

3181522

Merge branch 'main' into feature/torchthreads

7564f52

Leahlijuan requested a review from g-husam February 5, 2026 21:57

Leahlijuan added 2 commits February 5, 2026 16:58

Merge branch 'main' into feature/torchthreads

ff20040

Merge branch 'main' into feature/torchthreads

58b6a4c

g-husam previously approved these changes Feb 5, 2026

View reviewed changes

g-husam self-requested a review February 5, 2026 22:10

Leahlijuan closed this Feb 5, 2026

g-husam changed the title ~~feat(core): Explicitly set PyTorch intra-op threads.~~ perf(core): Explicitly set PyTorch intra-op threads. Feb 5, 2026

Leahlijuan reopened this Feb 5, 2026

resolve comments

5777559

g-husam reviewed Feb 6, 2026

View reviewed changes

src/ml_flashpoint/core/checkpoint_saver.py Outdated Show resolved Hide resolved

src/ml_flashpoint/core/utils.py Show resolved Hide resolved

Add tests for write_data when thread_count more than 1.

f17ccc6

Leahlijuan requested a review from g-husam February 9, 2026 17:17

g-husam approved these changes Feb 9, 2026

View reviewed changes

tests/core/test_checkpoint_saver.py Show resolved Hide resolved

Leahlijuan and others added 3 commits February 9, 2026 13:08

Apply suggestions from code review

ef09b5b

Co-authored-by: g-husam <husameldawi@google.com>

Merge branch 'main' into feature/torchthreads

cdafb83

Format

2ccb07b

Leahlijuan enabled auto-merge (squash) February 9, 2026 18:11

Leahlijuan merged commit d64060d into main Feb 9, 2026
5 checks passed

Leahlijuan deleted the feature/torchthreads branch February 9, 2026 18:18

gemini-code-assist bot mentioned this pull request Mar 3, 2026

feat(adapter/nemo): add optional fully parallel wrappers (#48) #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(core): Explicitly set PyTorch intra-op threads.#21

perf(core): Explicitly set PyTorch intra-op threads.#21
Leahlijuan merged 19 commits intomainfrom
feature/torchthreads

Leahlijuan commented Jan 30, 2026

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

g-husam left a comment

Uh oh!

g-husam left a comment

Uh oh!

g-husam Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

g-husam commented Feb 5, 2026

Uh oh!

Leahlijuan commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Leahlijuan commented Jan 30, 2026

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

g-husam left a comment

Choose a reason for hiding this comment

Uh oh!

g-husam left a comment

Choose a reason for hiding this comment

Uh oh!

g-husam Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

g-husam commented Feb 5, 2026

Uh oh!

Leahlijuan commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants