Skip to content

[CI] Refactor CI for better caching#294

Draft
brnorris03 wants to merge 21 commits intomainfrom
bnorris/ci-refactor
Draft

[CI] Refactor CI for better caching#294
brnorris03 wants to merge 21 commits intomainfrom
bnorris/ci-refactor

Conversation

@brnorris03
Copy link
Contributor

@brnorris03 brnorris03 commented Jan 29, 2026

Problem: Unnecessary rebuilding of the same tt-mlir; Docker container builds time out as a result but also (some) regular CI jobs take longer than they should. Inconsistent caching.

Solution:

  • Add call-build-ttmlir-toolchain.yml as dedicated cache builder
  • Simplify CI workflow to restore cache with fail-on-cache-miss
  • Simplify container workflow to use same unified cache
  • Update Docker build to require pre-built toolchain via --ttmlir-toolchain
  • Update Dockerfile to use toolchain from build context
  • Make ccache caching actually work by caching it per-ref not per commit
  • Add CI_WORKFLOWS.md documentation, update BUILD_SYSTEM.md, containers/README.md with new architecture
  • Refactor common functionality into helper scripts

New on-pr workflow:

  on-pr.yml (triggers on PR)                                                                      
      │                                                                                           
      ├── pre-commit (parallel)                                                                   
      │                                                                                           
      ├── toolchain (check-cache → build if needed)                                               
      │       │                                                                                   
      │       └── build (waits for toolchain via needs:)                                          
      │                                                                                           
      └── test-sim (parallel)                                                                     
      │                                                                                           
      └── check-all-green (waits for all) 

Follows GitHub Actions best practice: dedicated workflow builds and caches dependencies, other workflows consume with fail-on-cache-miss.

When the toolchain cache needs rebuilding, the regular on-pr workflow fails with a clear error message and instructions, e.g.:

Error: Toolchain cache not found for tt-mlir commit dd15572dc0b92167ea1b186161ea4f74107b0329
============================================================
  TOOLCHAIN CACHE NOT FOUND
============================================================
The LLVM + tt-mlir toolchain cache does not exist for this commit.
To fix this, run the toolchain build workflow:
  1. Go to: https://github.com/tenstorrent/tt-lang/actions/workflows/call-build-ttmlir-toolchain.yml
  2. Click 'Run workflow' button
  3. Click the green 'Run workflow' button in the dropdown
  4. Wait for completion (~3-4 hours for full LLVM build)
  5. Re-run this workflow
Cache details:
  Key: Linux-ttlang-toolchain-v1-dd15572dc0b92167ea1b186161ea4f74107b0329
  tt-mlir commit: dd15572dc0b92167ea1b186161ea4f74107b0329
See .github/CI_WORKFLOWS.md for more information.
============================================================

Checklist:

  • Self-reviewed (style, logic)
  • Added tests (or justified none needed)
  • PR is small and focused (one task)

  - Add call-build-ttmlir-toolchain.yml as dedicated cache builder
  - Simplify CI workflow to restore cache with fail-on-cache-miss
  - Simplify container workflow to use same unified cache
  - Update Docker build to require pre-built toolchain via
    --ttmlir-toolchain
  - Update Dockerfile to use toolchain from build context
  - Make ccache caching actually work by caching it per-ref not per
    commit
  - Add CI_WORKFLOWS.md documentation
  - Update BUILD_SYSTEM.md, containers/README.md with new architecture
  - Refactor common functionality into helper scripts

  Follows GitHub Actions best practice: dedicated workflow builds and
  caches dependencies, other workflows consume with fail-on-cache-miss.
@brnorris03 brnorris03 changed the title Refactor CI to use dedicated toolchain cache workflow [CI] Refactor CI to use dedicated toolchain cache workflow Jan 29, 2026
@brnorris03 brnorris03 changed the title [CI] Refactor CI to use dedicated toolchain cache workflow [CI] Refactor CI for better caching Jan 29, 2026
remove dependence on tt-mlir container
Updated container docs with new image names:
  - tt-lang-dev-ubuntu-22-04 - toolchain + dev tools (for developers)
  - tt-lang-user-ubuntu-22-04 - dev + tt-lang (for end users)
  - Split toolchain workflow into check-cache (ubuntu-latest) and build (large runner)
  - Change schedule from weekly to nightly
  - Remove redundant push trigger, add pull_request trigger for tt-mlir.commit changes
  - Fix Docker workflow to use determine-ttmlir-commit.sh and honor mlir_override input
  - Ensure consistent cache key format across all workflows
@brnorris03 brnorris03 force-pushed the bnorris/ci-refactor branch 2 times, most recently from 9e20814 to ce9c88a Compare January 30, 2026 02:11
@brnorris03 brnorris03 force-pushed the bnorris/ci-refactor branch 3 times, most recently from d7a6632 to 83896ec Compare January 30, 2026 03:58
@brnorris03 brnorris03 force-pushed the bnorris/ci-refactor branch 11 times, most recently from fdf4b0e to e3e964b Compare January 30, 2026 05:09
@brnorris03 brnorris03 force-pushed the bnorris/ci-refactor branch 3 times, most recently from 2670e35 to eb50c13 Compare January 30, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant