Skip to content

Comments

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto#38009

Open
ezhulenev wants to merge 1 commit intoopenxla:mainfrom
ezhulenev:network-topology-1
Open

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto#38009
ezhulenev wants to merge 1 commit intoopenxla:mainfrom
ezhulenev:network-topology-1

Conversation

@ezhulenev
Copy link
Contributor

@ezhulenev ezhulenev commented Feb 18, 2026

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178

@ezhulenev ezhulenev requested a review from mwhittaker February 18, 2026 19:26
copybara-service bot pushed a commit that referenced this pull request Feb 20, 2026
Imported from GitHub PR #38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
e74b1a5 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=#38009 from ezhulenev:network-topology-1 e74b1a5
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Feb 20, 2026
Imported from GitHub PR openxla/xla#38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
e74b1a5e48fdca9cc8d9ee4d437416a380af9b7a by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#38009 from ezhulenev:network-topology-1 e74b1a5e48fdca9cc8d9ee4d437416a380af9b7a
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit that referenced this pull request Feb 20, 2026
Imported from GitHub PR #38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
e74b1a5 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=#38009 from ezhulenev:network-topology-1 e74b1a5
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Feb 20, 2026
Imported from GitHub PR openxla/xla#38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
e74b1a5e48fdca9cc8d9ee4d437416a380af9b7a by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#38009 from ezhulenev:network-topology-1 e74b1a5e48fdca9cc8d9ee4d437416a380af9b7a
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit that referenced this pull request Feb 20, 2026
Imported from GitHub PR #38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
e74b1a5 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=#38009 from ezhulenev:network-topology-1 e74b1a5
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit that referenced this pull request Feb 20, 2026
Imported from GitHub PR #38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
e74b1a5 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=#38009 from ezhulenev:network-topology-1 e74b1a5
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Feb 20, 2026
Imported from GitHub PR openxla/xla#38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
e74b1a5e48fdca9cc8d9ee4d437416a380af9b7a by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#38009 from ezhulenev:network-topology-1 e74b1a5e48fdca9cc8d9ee4d437416a380af9b7a
PiperOrigin-RevId: 872989879
if (options.sort_devices_by_process_index) {
absl::c_sort(devices, [](const std::unique_ptr<PjRtDevice>& a,
const std::unique_ptr<PjRtDevice>& b) {
absl::c_stable_sort(devices, [](const std::unique_ptr<PjRtDevice>& a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eugene!

Could you elaborate the context of this change? It is unclear to me if stable sorting helps. Since IFRT requires all IFRT devices IDs to be unique within a single IFRT client, and PjRtDevice here is xla::ifrt::PjRtDevice, not xla::PjRtDevice, I think both sorting below would not see any duplicate keys, and thus stable sorting is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, forgot what was my motivation last week :) reverted it

VLOG(3) << "Global topology for platform " << platform << ":\n"
<< global_topology->DebugString();

// Because we might do global topology assignment based on network proximity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: When a loop is only doing VLOG, consider using the following pattern so that we can skip looping completely when debug logging is not enabled:

if (VLOG_IS_ON(3)) {
  VLOG(3) << ...;
  for (...) {
    VLOG(3) << ...;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Feb 23, 2026
Imported from GitHub PR openxla/xla#38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
4bcbb73be85e8c033bd3e580de2013a57bed704b by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#38009 from ezhulenev:network-topology-1 4bcbb73be85e8c033bd3e580de2013a57bed704b
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit that referenced this pull request Feb 23, 2026
Imported from GitHub PR #38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
4bcbb73 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=#38009 from ezhulenev:network-topology-1 4bcbb73
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit that referenced this pull request Feb 23, 2026
Imported from GitHub PR #38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
4bcbb73 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=#38009 from ezhulenev:network-topology-1 4bcbb73
PiperOrigin-RevId: 872989879
copybara-service bot pushed a commit that referenced this pull request Feb 23, 2026
Imported from GitHub PR #38009

Don't forget to pass network nodes to local topology proto, so that the coordinator process can use them to do network-topology-optimized global device assignment.

JAX PR that allows keeping devices sorted by assigned global device id: jax-ml/jax#35178
Copybara import of the project:

--
4bcbb73 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:pjrt:gpu] Pass network nodes to LocalTopologyProto

Merging this change closes #38009

FUTURE_COPYBARA_INTEGRATE_REVIEW=#38009 from ezhulenev:network-topology-1 4bcbb73
PiperOrigin-RevId: 872989879
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants