Skip to content

Optimize 1x32 Single Galaxy Topology Mapping to Reduce QSFP Usage #37547

@Riddy21

Description

@Riddy21

Summary

The topology mapper currently generates suboptimal mappings for the 1x32 Single Galaxy topology. It has been observed that the mapper aggressively utilizes QSFP torus connections for routing between nodes, treating them with insufficient cost relative to local links.

Context

  • Reporter: Ridvan Song
  • Impact: This mapping strategy results in excessive use of QSFP links (potentially "every other node"), leading to increased latency and poor performance characteristics for this topology.
  • Stakeholders: This issue is being tracked preemptively in case the Forge team (e.g., Uros Males, Vladimir Jovanovic) encounters performance bottlenecks with 1x32 topologies.

Requirements

  • Update the Topology Solver to include specific costing and constraints that penalize the use of QSFP links for the 1x32 Single Galaxy topology.
  • The solver should prioritize local/lower-latency paths where possible.

Examples

  • Current Behavior: The mapper assigns QSFP torus connections frequently (e.g., every other node) without regarding the latency cost.
  • Desired Behavior: The mapper generates a topology that minimizes QSFP usage, reserving it only for necessary long-distance hops.

Planning / Metadata

  • Repository: tenstorrent/tt-metal
  • Board: Control Plane TT-Distributed
  • Assignee: @Riddy21
  • Priority: P2
  • Labels: scale-out, topology-mapper, performance
  • Status: New Issue

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions