tt-metal main branch compatibility fixes #1293

shutovilyaep · 2025-11-19T19:03:01Z

Build System Modernization & PyPI Readiness

Summary

Modernizes the build system to eliminate manual environment setup and enable PyPI distribution. The extension now works without LD_LIBRARY_PATH or other environment variables. Note: running tests against development builds requires setting TT_METAL_HOME environment variable due to a bug in current tt-metal Python path discovery for source builds (not needed for PyPI wheel installations where runtime assets are bundled). CI validates wheel creation and tests installed wheels to emulate PyPI user experience. This PR fixes compatibility with tt-metal's current main branch (Nov 2025) and addresses project freeze requirements to prolong project building without manual code changes.

Key Changes

1. RPATH Configuration and Binary Bundling

Problem: Extension couldn't find PyTorch/TT-Metal libraries at runtime.

Fix: Configure RPATH and bundle ttnn binaries via CMake during wheel build.

BUILD_RPATH: $ORIGIN:${PYTORCH_LIB_DIR}:${TT_METAL_LIB_DIR}
INSTALL_RPATH: $ORIGIN:$ORIGIN/../torch/lib
TT-Metal libraries (libtt_metal.so, libtt_stl.so, libdevice.so, libtracy.so, _ttnncpp.so) are bundled into the wheel during CMake installation
Bundled libraries are placed in torch_ttnn_cpp_extension/ directory alongside the extension module
RPATH set to $ORIGIN allows extension to find bundled libraries in the same directory

Result: No LD_LIBRARY_PATH needed. Bundling simplifies PyPI wheel creation by including required binaries directly in the wheel.

2. Dependency Management

Problem: pip install -e .[dev] replaced locally built ttnn with old PyPI version, causing API mismatches.

Fix: Moved ttnn to optional [pypi] extra (Python packaging standard).

# dependencies list no longer includes ttnn

[project.optional-dependencies]
pypi = ["ttnn @ <direct-url-to-wheel>"]  # Auto-updated by CI

Usage:

Dev: pip install -e .[dev] (uses local ttnn)
PyPI: pip install torch-ttnn[pypi] (downloads ttnn)

3. CI Wheel Testing (PyPI User Simulation)

CI now validates wheel creation and tests installed wheels to ensure PyPI compatibility.

Process:

Run tests against development build (using submodule tt-metal)
Build wheel from source
Uninstall development packages (torch-ttnn and ttnn) to simulate clean PyPI environment
Install wheel with [pypi] extra (downloads ttnn from PyPI)
Run full test suite against PyPI-emulated package installation

Implementation:

build-test-release-wheel.yaml: Builds wheel, uninstalls dev installation, installs wheel with [pypi] extra, verifies installation
run-cpp-native-tests.yaml: Runs tests against development build, then builds wheel, uninstalls dev packages, and runs tests against PyPI-emulated installation
before_merge.yaml: Added build-wheel-check job that verifies wheel builds successfully before merge
Test jobs: test-wheel-smoke, test-wheel-lowering, test-wheel-model run comprehensive tests against installed wheels

This ensures wheels work correctly for PyPI users before release.

4. Consistent Compiler

Problem: scikit-build-core used GCC 11 instead of Clang-17.

Fix: Forced Clang-17 in pyproject.toml, removed hardcoded ABI flags (now auto-detected).

5. Source Code Updates

Code changes required due to tt-metal API changes and PyTorch 2.7.1 bump:

device.cpp: Updated header for tt-metal v0.64 compatibility
extension_utils.hpp: Added __FILE_NAME__ fallback for GCC 11 compatibility
ttnn_device_mode.py: Registered module as torch.ttnn (fixes import torch.ttnn due to PyTorch 2.7.1 change)
conftest.py: Updated to mesh_device API (tt-metal API change)

6. CI Infrastructure Updates

Docker images: Migrated CI workflows to use tt-metal Docker images instead of building pytorch2.0_ttnn-specific images
- Deprecated building torch-ttnn Docker images (TODOs left in workflows for future reference)
- Due to project freeze, pinned tt-metal Docker images to specific digests for reproducibility
- Ensures consistent build environment and reduces maintenance overhead
run-cpp-native-tests.yaml:
- Removed LD_LIBRARY_PATH exports
- Added pyproject.toml to trigger paths (enables CI on ttnn updates)
- Smart ttnn handling: Uses [pypi,dev] for ttnn update PRs (builds ttnn from source), [dev] for regular PRs (uses updated PyPI ttnn package)
- Direct URL dependencies force pip to replace local ttnn, ensuring new version is tested
update-ttnn-wheel.yaml: Updated to modify pyproject.toml [pypi] section, maintains auto-approve/merge gated on passing checks

7. Documentation and Error Messages

BuildFlow.md: Installation Quick Reference, dev vs PyPI workflows, CI behavior explanation
README.md: Updated installation instructions to show [pypi] extra is required
torch_ttnn/init.py: Fixed error message to show correct pip install torch-ttnn[pypi]
pyproject.toml: Added hint in description about [pypi] extra

Modernizes the C++ extension build system to follow Python packaging standards and enables PyPI distribution. The extension now works without manual environment variable setup. Key improvements: * Configure RPATH to use libraries from dependency packages - BUILD_RPATH includes absolute paths for build-time linking - INSTALL_RPATH: $ORIGIN:../torch/lib:../../ttnn/build/lib:../../ttnn - Extension uses TT-Metal libraries from ttnn package (no duplication) - Wheel size reduced to 768 KB (was 21 MB, 96% reduction) - Eliminates LD_LIBRARY_PATH requirement * Move ttnn to optional [pypi] dependency group - Prevents pip from replacing locally built ttnn during development - Follows Python packaging standards (PEP 621 optional dependencies) - Dev builds: pip install -e .[dev] (uses local ttnn) - PyPI users: pip install torch-ttnn[pypi] (downloads ttnn) * Force Clang-17 compiler for consistency - Set CMAKE_C_COMPILER and CMAKE_CXX_COMPILER in pyproject.toml - Matches tt-metal build compiler - Remove hardcoded ABI flags (now auto-detected from PyTorch) * Fix tt-metal v0.64 compatibility - Update header path: tt-metalium/assert.hpp → tt_stl/assert.hpp - Add __FILE_NAME__ fallback macro for GCC 11 compatibility - Register extension as torch.ttnn module for PyTorch backend * Update CI workflows for smart ttnn dependency testing - Remove LD_LIBRARY_PATH exports (RPATH handles this) - Add pyproject.toml to trigger paths in run-cpp-native-tests.yaml - Smart ttnn handling: detect ttnn update PRs by commit message, use [pypi,dev] to test new version, [dev] for regular builds - Direct URL dependencies force pip to replace local ttnn, ensuring automated updates are properly validated before auto-merge - Update ttnn-wheel-update workflow to modify pyproject.toml [pypi] section with direct URL format (was requirements.txt) * Add comprehensive documentation and clear error messages - Installation quick reference for dev and PyPI users - Development vs PyPI distribution workflows - Dependency strategy and troubleshooting guide - CI behavior explanation: how ttnn updates are tested - Update README.md with [pypi] installation instructions - Fix torch_ttnn/__init__.py error message to show correct [pypi] extra - Add hint in pyproject.toml description Tests pass without environment variables. Extension is self-contained via RPATH and ready for PyPI distribution.

@aliaksei-sala

Address review comment to prefer factory functions over constructors. - Replace manual HostBuffer + Tensor constructor pattern with Tensor::from_borrowed_data() factory function - Applied to both BFLOAT16 and UINT32 data type cases - Eliminates intermediate HostBuffer variable - Cleaner API that directly passes Span and MemoryPin to factory Benefits: - Follows tt-metal best practices for tensor construction - More idiomatic API usage - Simplified code with fewer intermediate steps Addresses: @aliaksei-sala comment on copy.cpp

…rovements PyTorch 2.7.1 resolved BFloat16 discrete uniform distribution limitations, allowing direct use of torch.randint() with all dtypes including bfloat16.

…2.0_ttnn

…/pytorch2.0_ttnn" This reverts commit bffad73.

…uirements.txt

Issues fixed: - PR#1243 Comment #4: .gitmodules Version Compatibility - PR#1243 Comment #5: TT_METAL_REF override removal Changes: 1. Updated .gitmodules branch from v0.58.0-rc25 to v0.62.0-dev20250916 - Matches current ttnn wheel version (0.62.0.dev20250916) - Added documentation explaining automated update process 2. Fixed update-ttnn-wheel.yaml sed pattern bug - Changed s/.dev/-/ to s/\.dev/-dev/ - Unescaped dot was matching any character, causing incorrect tag conversion - Example: 0.62.0.dev20250916 now correctly converts to v0.62.0-dev20250916 3. Removed TT_METAL_REF override from run-cpp-native-tests.yaml - Removed TT_METAL_REF: main env variable - Removed forced git checkout to specific branch - CI now respects .gitmodules branch configuration - Ensures automated ttnn update workflow changes are actually used This allows the automated dependency update workflow to function correctly: when a new ttnn wheel is released, the workflow will update both pyproject.toml and .gitmodules, and CI will test with the matching versions.

Issue fixed: - PR#1243 Comment #1: README.md installation instructions Changes: - Rewrote installation section with direct copy-paste commands - Removed environment variables (TT_METAL_HOME) - Added prerequisite installation step (pip install --upgrade pip scikit-build-core cmake ninja) - Added numbered steps for clarity - Added link to BuildFlow.md for detailed documentation - Made instructions beginner-friendly and copy-paste ready

…ents-dev.txt Issue fixed: - PR#1243 Comment #2: requirements-dev.txt Removal - CI/CD Impact Changes: Updated 3 action files to use pyproject.toml dependency specification: 1. build_cpp_extension_artifacts/action.yaml - Changed: pip install -r requirements-dev.txt → pip install -e .[dev] - Removed: redundant pip downgrade and numpy/setuptools installation 2. common_wheel_install/action.yaml - Changed cache dependency path from requirements files to pyproject.toml - Updated: pip install dist/torch_ttnn-*.whl + requirements-dev.txt → pip install dist/torch_ttnn-*.whl[dev] - Simplified installation logic 3. common_repo_setup/action.yaml - Changed: pip install -r requirements-dev.txt → pip install -e .[dev] - Enabled pip cache with pyproject.toml as cache-dependency-path - Removed commented requirements-dev.txt reference All actions now follow modern Python packaging standards (PEP 517/621).

Changed: <tt_stl/assert.hpp> → <tt-metalium/assert.hpp> The header path changed between tt-metal versions. The v0.62.0-dev20250916 uses tt-metalium/assert.hpp instead of tt_stl/assert.hpp.

Issue fixed: - PR#1243 Comment #3: TT_METAL_HOME Deprecation Changes: - Added auto-detection of TT_METAL_HOME from submodule path when env var not set - Updated CMakeLists.txt to fallback to third-party/tt-metal if TT_METAL_HOME unset - Updated documentation to mark TT_METAL_HOME as optional (was: REQUIRED) - Provides clear error message if neither env var nor submodule are available Benefits: - Users don't need to manually set TT_METAL_HOME for standard builds - Eliminates conflicts when switching between TT projects (tt-train, etc.) - Still supports TT_METAL_HOME override for advanced use cases - CI workflows can keep using it explicitly for clarity, but it's not required The CMake logic now: 1. If TT_METAL_HOME env var is set → use it 2. Else → auto-detect from third-party/tt-metal submodule 3. If neither → clear error message This addresses aliaksei-sala's concern about TT_METAL_HOME being error-prone when switching between projects.

Issue fixed: - PR#1243 Comment: Fresh venv without C++ extension support Changes: - Added SKIP_CPP_EXTENSION environment variable to CMakeLists.txt - When set to 1, CMake skips C++ extension build entirely - Allows pip install -e .[pypi,dev] in fresh venv without tt-metal/toolchain - Provides clear status message when skipping Usage: export SKIP_CPP_EXTENSION=1 pip install -e .[pypi,dev] Use cases: - Installing just Python dependencies for development - Testing Python code without C++ compilation - CI jobs that don't need native integration - Quick setup without full toolchain This restores the previous capability of installing in pure Python mode that was available with requirements-dev.txt.

Added documentation for pure Python installation mode in README.md. Users can now skip C++ extension build by setting SKIP_CPP_EXTENSION=1: export SKIP_CPP_EXTENSION=1 pip install -e .[pypi,dev] This is useful for: - Installing Python dependencies only - Testing Python code without C++ toolchain - Quick setup without full build Related to PR#1243 comment about supporting installation without C++ extension support.

Added verbose output to make C++ compilation logs visible: - cmake.verbose = true (shows CMake configuration details) - logging.level = "INFO" (shows build progress) Benefits: - C++ compilation output visible on both CI and local builds - Better debugging when build issues occur - Clear visibility of SKIP_CPP_EXTENSION when used - CMake configuration details always shown This makes the build process more transparent and easier to debug.

BFloat16's 7-bit mantissa limits precise integer representation to the range [-256, 256] for discrete uniform distributions. Using torch.randint(-1000, 1000, dtype=bfloat16) exceeds this range and triggers PyTorch warnings that will become hard errors in future releases: "Due to precision limitations c10::BFloat16 can support discrete uniform distribution only within this range. This warning will become an error in version 1.7 release" (Triggered at pytorch/aten/src/ATen/native/DistributionTemplates.h:111-112) Root Cause: - BFloat16 format: [sign: 1 bit][exponent: 8 bits][mantissa: 7 bits] - With only 7 mantissa bits, BFloat16 can represent ~128 distinct integers - Values outside [-256, 256] cannot all be represented exactly, violating the "uniform distribution" property Solution: Apply dtype-conditional tensor creation (matching pattern already used in test_cpp_extension, lines 27-36): - BFloat16: Use .uniform_() with safe range (-256, 256) - Int types: Use torch.randint() with full range [-1000, 1000) This follows tt-metal best practices: - tt-metal/tests/ttnn/unit_tests/operations/eltwise/test_binary_ng_typecast.py uses torch.randint(low=-50, high=50, dtype=torch.bfloat16) - tt-metal/tests/ttnn/unit_tests/tensor/test_tensor_ranks.py uses torch.randint(low=0, high=100).to(torch.bfloat16) - tt-metal uses torch.randint(-1000, 1000) ONLY with torch.int32, never BFloat16 History: - Bug introduced: Oct 2024 (commit 1099fde) when test was first created - First fix: Nov 4, 2025 (commit 361fea6) - changed to [-255, 256) - Regression: Nov 13, 2025 (commit 4d73e5e) - incorrectly reverted based on misunderstanding that "PyTorch 2.7.1 resolved BFloat16 limitations" (the limitation is intrinsic to BFloat16's data format, not a PyTorch bug) Pre-existing issue in main branch since PyTorch 2.7.1 upgrade (Nov 9, 2025). Fixes warnings in CI: https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/19375082596/job/55440386669

Addresses jmalone-tt's question about secrets.GITHUB_TOKEN compatibility with tt-metal Docker container pulls. Added container credentials: - username: github.actor - password: secrets.GITHUB_TOKEN (GitHub's default token) This uses the standard GitHub-provided token, no custom tokens needed. Container pulls from ghcr.io/tenstorrent/tt-metal now authenticate properly. Fixes: #1243 (comment)

… variable Revised the installation instructions in README.md and BuildFlow.md to clarify the process for building TT-Metal as a git submodule. Emphasized that the build system automatically detects TT-Metal and actively ignores the TT_METAL_HOME environment variable to prevent conflicts. Updated related documentation to reflect these changes and ensure users follow the correct setup steps for development and installation.

Example of CI failed due to fetch problems: https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/19463660350/job/55693557194?pr=1243

- Simplified the submodule checkout process by removing redundant commands and ensuring a clean state before fetching. - Implemented a super-clean checkout strategy to prevent stale reference errors when tt-metal is force-pushed. - Updated environment variable handling and installation scripts for clarity and efficiency. - Enhanced logging and error tolerance during submodule synchronization to improve CI robustness.

…undling

- Added workaround for broken auto-detection in tt-metal source builds.

…mpatibility testing - Implemented a temporary conditional include for the assert header to ensure compatibility with both current and future versions of tt-metal. - This change allows for smoother transitions as the tt-metal version stabilizes.

- build_cpp_extension.sh, run_cpp_extension_tests.sh are created with Release, Debug modes

…t when repo is created, perform a clean checkout via GitHub actions

… organization - Grouped core runtime dependencies and added comments for better understanding. - Moved data analysis and visualization libraries to a dedicated section in the dev dependencies. - Ensured all dependencies are clearly categorized for easier maintenance.

Add comprehensive wheel testing to all CI workflows: - Build wheel from submodule after tests pass - Uninstall torch-ttnn AND ttnn to simulate clean environment - Install wheel with [pypi] extra (gets ttnn from PyPI like users) - Re-run tests to verify wheel works for end users Changes: - run-cpp-native-tests.yaml: Add wheel test after tests (every PR) - build-test-release-wheel.yaml: Fix pipeline, use build_cpp_extension.sh - before_merge.yaml: Add build-wheel-check as merge gate Every commit now verified to produce working wheel for PyPI users.

Add wheel verification to CI and fix packaging issues for PyPI distribution. Fixes: 1. Wheel packaging (pyproject.toml): - Exclude submodule from wheel - Exclude build artifacts and scripts 2. Bundle libtracy.so (CMakeLists.txt): - Workaround for ttnn PyPI wheel bug (v0.62.0-dev20250916) - libtt_metal.so depends on libtracy but ttnn wheel doesn't include it - Bundle libtracy*.so* to avoid runtime errors

…mentation with reasoning is created

shutovilyaep · 2025-11-19T19:23:29Z

Created a branch in "tenstorrent" repository with the same content as #1243
@kevinwuTT asked to run manually https://github.com/tenstorrent/pytorch2.0_ttnn/actions/workflows/run-tests.yaml

Selected "Commit generated report files: None", @kevinwuTT please validate if I need to write another option there
I see it failed, taking a look https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/19513628352

shutovilyaep · 2025-11-19T21:50:43Z

run-tests.yaml run after CI script fixes
https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/19517439120/job/55872712360
(failed due to sfpi something, TODO: to take a look)
Done (README.md.in): to update docs about pip-tools to install Python only dependencies, not building via scikit-build-core

- installing git-lfs - accurate checkout not to present tt-metal's submodule .github folders to GitHub containing not available actions for all runners

shutovilyaep · 2025-11-20T14:17:54Z

"The algorithm should be correct, and then be optimized"

Observations after some time of dealing with GitHub actions:

Correctness: [as done in this PR] for the sake of complete testing and checking, especially in cases when there are a lot of changes, it is a good idea to verify everything on any change of C++ code or pyproject.toml configuration/dependencies
[C++ building, Python development build from sources, running tests against development build, wheel creation, running tests against installed-from-built-wheel solution]
Optimization: that long pipeline will lead to over-usage of TT hardware if enabled as CI job to run on every PR change

…MUST be set when running tests

shutovilyaep added 30 commits November 19, 2025 08:56

PR fix: test: simplify tensor creation with PyTorch 2.7.1 randint imp…

8fb72fb

…rovements PyTorch 2.7.1 resolved BFloat16 discrete uniform distribution limitations, allowing direct use of torch.randint() with all dtypes including bfloat16.

PR fix: clang-format workaround removed

49ae48d

PR fix: ttnn_device_mode.py docs

550a5c0

PR check: removed retry on initial git submodules sync

9b9c1ec

PR fix: revert to using base Docker image ghcr.io/tenstorrent/pytorch…

30551b5

…2.0_ttnn

Revert "PR fix: revert to using base Docker image ghcr.io/tenstorrent…

d57d8d8

…/pytorch2.0_ttnn" This reverts commit bffad73.

fix: update-docker-container.yaml fix - using pyproject.toml, not req…

6b8cfdb

…uirements.txt

comments update

1c030df

fix: update header path for tt-metal v0.62.0-dev20250916

55680ec

Changed: <tt_stl/assert.hpp> → <tt-metalium/assert.hpp> The header path changed between tt-metal versions. The v0.62.0-dev20250916 uses tt-metalium/assert.hpp instead of tt_stl/assert.hpp.

fix: CI fetch robustness

a8462a7

Example of CI failed due to fetch problems: https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/19463660350/job/55693557194?pr=1243

CI fix attempt

5aa4866

docs: updated deprecated docs related to TT_METAL_HOME and binaries b…

9fa4ee1

…undling

fix:

a05a31e

- Added workaround for broken auto-detection in tt-metal source builds.

fix: enhance C++ extension build and test workflow, docs are updated

420fb2a

- build_cpp_extension.sh, run_cpp_extension_tests.sh are created with Release, Debug modes

quickfix: llk-related updates do not allow to cleanly perform checkou…

77ed6b2

…t when repo is created, perform a clean checkout via GitHub actions

shutovilyaep added 4 commits November 19, 2025 08:56

fix: bundling binaries to torch-ttnn to fix wheel for end-users, docu…

4152a28

…mentation with reasoning is created

(build) CMake files cleanup

09fac67

shutovilyaep mentioned this pull request Nov 19, 2025

[initial] [forked repository -> tenstorrent] tt-metal main branch compatibility fixes #1243

Closed

shutovilyaep requested review from ayerofieiev-tt, jmalone-tt and kevinwuTT November 19, 2025 19:23

shutovilyaep force-pushed the fix/tt_metal_bump branch 3 times, most recently from 813bc69 to fb5aa74 Compare November 19, 2025 21:44

shutovilyaep force-pushed the fix/tt_metal_bump branch 7 times, most recently from cd3f9ad to cd134f4 Compare November 20, 2025 10:48

(ci) fix: GitHub Actions worflows updates

29d7278

- installing git-lfs - accurate checkout not to present tt-metal's submodule .github folders to GitHub containing not available actions for all runners

shutovilyaep force-pushed the fix/tt_metal_bump branch from faa93aa to 29d7278 Compare November 20, 2025 13:32

shutovilyaep added 3 commits November 20, 2025 13:34

(ci) Installing tt-metal dependencies including sfpi

b3b0a91

(ci) install_full_dependencies checkbox added

a29b373

(ci) possible error fix

8e87dd2

shutovilyaep force-pushed the fix/tt_metal_bump branch from 3baf366 to 8e87dd2 Compare November 20, 2025 14:04

shutovilyaep added 2 commits November 20, 2025 20:01

(docs) Make it clear that TT_METAL_HOME is ignored during build, but …

dd09d71

…MUST be set when running tests

(ci) setting TT_METAL_HOME env variable before running tests

254f264

shutovilyaep force-pushed the fix/tt_metal_bump branch from 6f5cbe9 to 254f264 Compare November 20, 2025 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tt-metal main branch compatibility fixes #1293

tt-metal main branch compatibility fixes #1293

Uh oh!

shutovilyaep commented Nov 19, 2025

Uh oh!

shutovilyaep commented Nov 19, 2025

Uh oh!

shutovilyaep commented Nov 19, 2025 •

edited

Loading

Uh oh!

shutovilyaep commented Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tt-metal main branch compatibility fixes #1293

Are you sure you want to change the base?

tt-metal main branch compatibility fixes #1293

Uh oh!

Conversation

shutovilyaep commented Nov 19, 2025

Build System Modernization & PyPI Readiness

Summary

Key Changes

1. RPATH Configuration and Binary Bundling

2. Dependency Management

3. CI Wheel Testing (PyPI User Simulation)

4. Consistent Compiler

5. Source Code Updates

6. CI Infrastructure Updates

7. Documentation and Error Messages

Uh oh!

shutovilyaep commented Nov 19, 2025

Uh oh!

shutovilyaep commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shutovilyaep commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shutovilyaep commented Nov 19, 2025 •

edited

Loading

shutovilyaep commented Nov 20, 2025 •

edited

Loading