Skip to content

This CL introduces several enhancements to the dot product scheduling and parallelism logic:#9916

Open
copybara-service[bot] wants to merge 1 commit intomasterfrom
test_896309637
Open

This CL introduces several enhancements to the dot product scheduling and parallelism logic:#9916
copybara-service[bot] wants to merge 1 commit intomasterfrom
test_896309637

Conversation

@copybara-service
Copy link
Copy Markdown
Contributor

@copybara-service copybara-service bot commented Apr 8, 2026

This CL introduces several enhancements to the dot product scheduling and parallelism logic:

Cache-Aware Scheduling:
The schedule_dot function now utilizes cpu_info (including L1, L2, and L3 cache sizes and L3 sharing) to make more informed tiling decisions. Additions:

  • A fast path for small matrices that fit entirely within the L2 cache, skipping K-tiling.
  • Outer k-loop tiling sized to fit within the L2 cache. The smaller of matrix A or B is kept cache-resident.
  • If both A and B are contiguous, we make use of L3 and effective prefetching to hide load latency.

Dynamic Tiling for Parallelism:
choose_split_factors now dynamically determines the 2D tiling (m_split, n_split) for parallel execution. Additions:

  • Inclusion of element sizes (elem_a, elem_b, elem_c) for more accurate footprint calculations.
  • A fast path for very small workloads to run on a single thread.
  • Asymmetric matrix shape handling (M-heavy or N-heavy) through aspect ratios, and increase in target footprints to prevent inefficient slivers.

… and parallelism logic:

Cache-Aware Scheduling:
The `schedule_dot` function now utilizes cpu_info (including L1, L2, and L3 cache sizes and L3 sharing) to make more informed tiling decisions. Additions:
* A fast path for small matrices that fit entirely within the L2 cache, skipping K-tiling.
* Outer k-loop tiling sized to fit within the L2 cache. The smaller of matrix A or B is kept cache-resident.
* If both A and B are contiguous, we make use of L3 and effective prefetching to hide load latency.

Dynamic Tiling for Parallelism:
`choose_split_factors` now dynamically determines the 2D tiling (m_split, n_split) for parallel execution. Additions:
* Inclusion of element sizes (elem_a, elem_b, elem_c) for more accurate footprint calculations.
* A fast path for very small workloads to run on a single thread.
* Asymmetric matrix shape handling (M-heavy or N-heavy) through aspect ratios, and increase in target footprints to prevent inefficient slivers.

PiperOrigin-RevId: 896309637
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant