-
Notifications
You must be signed in to change notification settings - Fork 25
tt-metal main branch compatibility fixes #1293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shutovilyaep
wants to merge
41
commits into
main
Choose a base branch
from
fix/tt_metal_bump
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Modernizes the C++ extension build system to follow Python packaging
standards and enables PyPI distribution. The extension now works
without manual environment variable setup.
Key improvements:
* Configure RPATH to use libraries from dependency packages
- BUILD_RPATH includes absolute paths for build-time linking
- INSTALL_RPATH: $ORIGIN:../torch/lib:../../ttnn/build/lib:../../ttnn
- Extension uses TT-Metal libraries from ttnn package (no duplication)
- Wheel size reduced to 768 KB (was 21 MB, 96% reduction)
- Eliminates LD_LIBRARY_PATH requirement
* Move ttnn to optional [pypi] dependency group
- Prevents pip from replacing locally built ttnn during development
- Follows Python packaging standards (PEP 621 optional dependencies)
- Dev builds: pip install -e .[dev] (uses local ttnn)
- PyPI users: pip install torch-ttnn[pypi] (downloads ttnn)
* Force Clang-17 compiler for consistency
- Set CMAKE_C_COMPILER and CMAKE_CXX_COMPILER in pyproject.toml
- Matches tt-metal build compiler
- Remove hardcoded ABI flags (now auto-detected from PyTorch)
* Fix tt-metal v0.64 compatibility
- Update header path: tt-metalium/assert.hpp → tt_stl/assert.hpp
- Add __FILE_NAME__ fallback macro for GCC 11 compatibility
- Register extension as torch.ttnn module for PyTorch backend
* Update CI workflows for smart ttnn dependency testing
- Remove LD_LIBRARY_PATH exports (RPATH handles this)
- Add pyproject.toml to trigger paths in run-cpp-native-tests.yaml
- Smart ttnn handling: detect ttnn update PRs by commit message,
use [pypi,dev] to test new version, [dev] for regular builds
- Direct URL dependencies force pip to replace local ttnn, ensuring
automated updates are properly validated before auto-merge
- Update ttnn-wheel-update workflow to modify pyproject.toml [pypi]
section with direct URL format (was requirements.txt)
* Add comprehensive documentation and clear error messages
- Installation quick reference for dev and PyPI users
- Development vs PyPI distribution workflows
- Dependency strategy and troubleshooting guide
- CI behavior explanation: how ttnn updates are tested
- Update README.md with [pypi] installation instructions
- Fix torch_ttnn/__init__.py error message to show correct [pypi] extra
- Add hint in pyproject.toml description
Tests pass without environment variables. Extension is self-contained
via RPATH and ready for PyPI distribution.
Address review comment to prefer factory functions over constructors. - Replace manual HostBuffer + Tensor constructor pattern with Tensor::from_borrowed_data() factory function - Applied to both BFLOAT16 and UINT32 data type cases - Eliminates intermediate HostBuffer variable - Cleaner API that directly passes Span and MemoryPin to factory Benefits: - Follows tt-metal best practices for tensor construction - More idiomatic API usage - Simplified code with fewer intermediate steps Addresses: @aliaksei-sala comment on copy.cpp
…rovements PyTorch 2.7.1 resolved BFloat16 discrete uniform distribution limitations, allowing direct use of torch.randint() with all dtypes including bfloat16.
…/pytorch2.0_ttnn" This reverts commit bffad73.
Issues fixed: - PR#1243 Comment #4: .gitmodules Version Compatibility - PR#1243 Comment #5: TT_METAL_REF override removal Changes: 1. Updated .gitmodules branch from v0.58.0-rc25 to v0.62.0-dev20250916 - Matches current ttnn wheel version (0.62.0.dev20250916) - Added documentation explaining automated update process 2. Fixed update-ttnn-wheel.yaml sed pattern bug - Changed s/.dev/-/ to s/\.dev/-dev/ - Unescaped dot was matching any character, causing incorrect tag conversion - Example: 0.62.0.dev20250916 now correctly converts to v0.62.0-dev20250916 3. Removed TT_METAL_REF override from run-cpp-native-tests.yaml - Removed TT_METAL_REF: main env variable - Removed forced git checkout to specific branch - CI now respects .gitmodules branch configuration - Ensures automated ttnn update workflow changes are actually used This allows the automated dependency update workflow to function correctly: when a new ttnn wheel is released, the workflow will update both pyproject.toml and .gitmodules, and CI will test with the matching versions.
Issue fixed: - PR#1243 Comment #1: README.md installation instructions Changes: - Rewrote installation section with direct copy-paste commands - Removed environment variables (TT_METAL_HOME) - Added prerequisite installation step (pip install --upgrade pip scikit-build-core cmake ninja) - Added numbered steps for clarity - Added link to BuildFlow.md for detailed documentation - Made instructions beginner-friendly and copy-paste ready
…ents-dev.txt Issue fixed: - PR#1243 Comment #2: requirements-dev.txt Removal - CI/CD Impact Changes: Updated 3 action files to use pyproject.toml dependency specification: 1. build_cpp_extension_artifacts/action.yaml - Changed: pip install -r requirements-dev.txt → pip install -e .[dev] - Removed: redundant pip downgrade and numpy/setuptools installation 2. common_wheel_install/action.yaml - Changed cache dependency path from requirements files to pyproject.toml - Updated: pip install dist/torch_ttnn-*.whl + requirements-dev.txt → pip install dist/torch_ttnn-*.whl[dev] - Simplified installation logic 3. common_repo_setup/action.yaml - Changed: pip install -r requirements-dev.txt → pip install -e .[dev] - Enabled pip cache with pyproject.toml as cache-dependency-path - Removed commented requirements-dev.txt reference All actions now follow modern Python packaging standards (PEP 517/621).
Changed: <tt_stl/assert.hpp> → <tt-metalium/assert.hpp> The header path changed between tt-metal versions. The v0.62.0-dev20250916 uses tt-metalium/assert.hpp instead of tt_stl/assert.hpp.
Issue fixed: - PR#1243 Comment #3: TT_METAL_HOME Deprecation Changes: - Added auto-detection of TT_METAL_HOME from submodule path when env var not set - Updated CMakeLists.txt to fallback to third-party/tt-metal if TT_METAL_HOME unset - Updated documentation to mark TT_METAL_HOME as optional (was: REQUIRED) - Provides clear error message if neither env var nor submodule are available Benefits: - Users don't need to manually set TT_METAL_HOME for standard builds - Eliminates conflicts when switching between TT projects (tt-train, etc.) - Still supports TT_METAL_HOME override for advanced use cases - CI workflows can keep using it explicitly for clarity, but it's not required The CMake logic now: 1. If TT_METAL_HOME env var is set → use it 2. Else → auto-detect from third-party/tt-metal submodule 3. If neither → clear error message This addresses aliaksei-sala's concern about TT_METAL_HOME being error-prone when switching between projects.
Issue fixed: - PR#1243 Comment: Fresh venv without C++ extension support Changes: - Added SKIP_CPP_EXTENSION environment variable to CMakeLists.txt - When set to 1, CMake skips C++ extension build entirely - Allows pip install -e .[pypi,dev] in fresh venv without tt-metal/toolchain - Provides clear status message when skipping Usage: export SKIP_CPP_EXTENSION=1 pip install -e .[pypi,dev] Use cases: - Installing just Python dependencies for development - Testing Python code without C++ compilation - CI jobs that don't need native integration - Quick setup without full toolchain This restores the previous capability of installing in pure Python mode that was available with requirements-dev.txt.
Added documentation for pure Python installation mode in README.md. Users can now skip C++ extension build by setting SKIP_CPP_EXTENSION=1: export SKIP_CPP_EXTENSION=1 pip install -e .[pypi,dev] This is useful for: - Installing Python dependencies only - Testing Python code without C++ toolchain - Quick setup without full build Related to PR#1243 comment about supporting installation without C++ extension support.
Added verbose output to make C++ compilation logs visible: - cmake.verbose = true (shows CMake configuration details) - logging.level = "INFO" (shows build progress) Benefits: - C++ compilation output visible on both CI and local builds - Better debugging when build issues occur - Clear visibility of SKIP_CPP_EXTENSION when used - CMake configuration details always shown This makes the build process more transparent and easier to debug.
BFloat16's 7-bit mantissa limits precise integer representation to the range [-256, 256] for discrete uniform distributions. Using torch.randint(-1000, 1000, dtype=bfloat16) exceeds this range and triggers PyTorch warnings that will become hard errors in future releases: "Due to precision limitations c10::BFloat16 can support discrete uniform distribution only within this range. This warning will become an error in version 1.7 release" (Triggered at pytorch/aten/src/ATen/native/DistributionTemplates.h:111-112) Root Cause: - BFloat16 format: [sign: 1 bit][exponent: 8 bits][mantissa: 7 bits] - With only 7 mantissa bits, BFloat16 can represent ~128 distinct integers - Values outside [-256, 256] cannot all be represented exactly, violating the "uniform distribution" property Solution: Apply dtype-conditional tensor creation (matching pattern already used in test_cpp_extension, lines 27-36): - BFloat16: Use .uniform_() with safe range (-256, 256) - Int types: Use torch.randint() with full range [-1000, 1000) This follows tt-metal best practices: - tt-metal/tests/ttnn/unit_tests/operations/eltwise/test_binary_ng_typecast.py uses torch.randint(low=-50, high=50, dtype=torch.bfloat16) - tt-metal/tests/ttnn/unit_tests/tensor/test_tensor_ranks.py uses torch.randint(low=0, high=100).to(torch.bfloat16) - tt-metal uses torch.randint(-1000, 1000) ONLY with torch.int32, never BFloat16 History: - Bug introduced: Oct 2024 (commit 1099fde) when test was first created - First fix: Nov 4, 2025 (commit 361fea6) - changed to [-255, 256) - Regression: Nov 13, 2025 (commit 4d73e5e) - incorrectly reverted based on misunderstanding that "PyTorch 2.7.1 resolved BFloat16 limitations" (the limitation is intrinsic to BFloat16's data format, not a PyTorch bug) Pre-existing issue in main branch since PyTorch 2.7.1 upgrade (Nov 9, 2025). Fixes warnings in CI: https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/19375082596/job/55440386669
Addresses jmalone-tt's question about secrets.GITHUB_TOKEN compatibility with tt-metal Docker container pulls. Added container credentials: - username: github.actor - password: secrets.GITHUB_TOKEN (GitHub's default token) This uses the standard GitHub-provided token, no custom tokens needed. Container pulls from ghcr.io/tenstorrent/tt-metal now authenticate properly. Fixes: #1243 (comment)
… variable Revised the installation instructions in README.md and BuildFlow.md to clarify the process for building TT-Metal as a git submodule. Emphasized that the build system automatically detects TT-Metal and actively ignores the TT_METAL_HOME environment variable to prevent conflicts. Updated related documentation to reflect these changes and ensure users follow the correct setup steps for development and installation.
Example of CI failed due to fetch problems: https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/19463660350/job/55693557194?pr=1243
- Simplified the submodule checkout process by removing redundant commands and ensuring a clean state before fetching. - Implemented a super-clean checkout strategy to prevent stale reference errors when tt-metal is force-pushed. - Updated environment variable handling and installation scripts for clarity and efficiency. - Enhanced logging and error tolerance during submodule synchronization to improve CI robustness.
…mpatibility testing - Implemented a temporary conditional include for the assert header to ensure compatibility with both current and future versions of tt-metal. - This change allows for smoother transitions as the tt-metal version stabilizes.
- build_cpp_extension.sh, run_cpp_extension_tests.sh are created with Release, Debug modes
…t when repo is created, perform a clean checkout via GitHub actions
… organization - Grouped core runtime dependencies and added comments for better understanding. - Moved data analysis and visualization libraries to a dedicated section in the dev dependencies. - Ensured all dependencies are clearly categorized for easier maintenance.
Add comprehensive wheel testing to all CI workflows: - Build wheel from submodule after tests pass - Uninstall torch-ttnn AND ttnn to simulate clean environment - Install wheel with [pypi] extra (gets ttnn from PyPI like users) - Re-run tests to verify wheel works for end users Changes: - run-cpp-native-tests.yaml: Add wheel test after tests (every PR) - build-test-release-wheel.yaml: Fix pipeline, use build_cpp_extension.sh - before_merge.yaml: Add build-wheel-check as merge gate Every commit now verified to produce working wheel for PyPI users.
Add wheel verification to CI and fix packaging issues for PyPI distribution. Fixes: 1. Wheel packaging (pyproject.toml): - Exclude submodule from wheel - Exclude build artifacts and scripts 2. Bundle libtracy.so (CMakeLists.txt): - Workaround for ttnn PyPI wheel bug (v0.62.0-dev20250916) - libtt_metal.so depends on libtracy but ttnn wheel doesn't include it - Bundle libtracy*.so* to avoid runtime errors
…mentation with reasoning is created
Author
|
Created a branch in "tenstorrent" repository with the same content as #1243
|
813bc69 to
fb5aa74
Compare
Author
|
cd3f9ad to
cd134f4
Compare
- installing git-lfs - accurate checkout not to present tt-metal's submodule .github folders to GitHub containing not available actions for all runners
faa93aa to
29d7278
Compare
3baf366 to
8e87dd2
Compare
Author
|
"The algorithm should be correct, and then be optimized" Observations after some time of dealing with GitHub actions:
|
6f5cbe9 to
254f264
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Build System Modernization & PyPI Readiness
Summary
Modernizes the build system to eliminate manual environment setup and enable PyPI distribution. The extension now works without
LD_LIBRARY_PATHor other environment variables. Note: running tests against development builds requires settingTT_METAL_HOMEenvironment variable due to a bug in current tt-metal Python path discovery for source builds (not needed for PyPI wheel installations where runtime assets are bundled). CI validates wheel creation and tests installed wheels to emulate PyPI user experience. This PR fixes compatibility with tt-metal's current main branch (Nov 2025) and addresses project freeze requirements to prolong project building without manual code changes.Key Changes
1. RPATH Configuration and Binary Bundling
Problem: Extension couldn't find PyTorch/TT-Metal libraries at runtime.
Fix: Configure RPATH and bundle ttnn binaries via CMake during wheel build.
BUILD_RPATH:$ORIGIN:${PYTORCH_LIB_DIR}:${TT_METAL_LIB_DIR}INSTALL_RPATH:$ORIGIN:$ORIGIN/../torch/libtorch_ttnn_cpp_extension/directory alongside the extension module$ORIGINallows extension to find bundled libraries in the same directoryResult: No
LD_LIBRARY_PATHneeded. Bundling simplifies PyPI wheel creation by including required binaries directly in the wheel.2. Dependency Management
Problem:
pip install -e .[dev]replaced locally built ttnn with old PyPI version, causing API mismatches.Fix: Moved
ttnnto optional[pypi]extra (Python packaging standard).Usage:
pip install -e .[dev](uses local ttnn)pip install torch-ttnn[pypi](downloads ttnn)3. CI Wheel Testing (PyPI User Simulation)
CI now validates wheel creation and tests installed wheels to ensure PyPI compatibility.
Process:
[pypi]extra (downloads ttnn from PyPI)Implementation:
build-test-release-wheel.yaml: Builds wheel, uninstalls dev installation, installs wheel with[pypi]extra, verifies installationrun-cpp-native-tests.yaml: Runs tests against development build, then builds wheel, uninstalls dev packages, and runs tests against PyPI-emulated installationbefore_merge.yaml: Addedbuild-wheel-checkjob that verifies wheel builds successfully before mergetest-wheel-smoke,test-wheel-lowering,test-wheel-modelrun comprehensive tests against installed wheelsThis ensures wheels work correctly for PyPI users before release.
4. Consistent Compiler
Problem: scikit-build-core used GCC 11 instead of Clang-17.
Fix: Forced Clang-17 in pyproject.toml, removed hardcoded ABI flags (now auto-detected).
5. Source Code Updates
Code changes required due to tt-metal API changes and PyTorch 2.7.1 bump:
__FILE_NAME__fallback for GCC 11 compatibilitytorch.ttnn(fixesimport torch.ttnndue to PyTorch 2.7.1 change)6. CI Infrastructure Updates
LD_LIBRARY_PATHexportspyproject.tomlto trigger paths (enables CI on ttnn updates)[pypi,dev]for ttnn update PRs (builds ttnn from source),[dev]for regular PRs (uses updated PyPI ttnn package)pyproject.toml[pypi] section, maintains auto-approve/merge gated on passing checks7. Documentation and Error Messages
[pypi]extra is requiredpip install torch-ttnn[pypi][pypi]extra