Skip to content

Conversation

@distributedstatemachine
Copy link
Member

@distributedstatemachine distributedstatemachine commented Sep 12, 2025

Summary by CodeRabbit

  • New Features

    • Docker-based validator deployment with GPU support, Watchtower auto-updates, and a publishable Docker image workflow.
    • New emission-burn environment options to configure automated burn behavior.
    • Expanded public utilities (Hugging Face helpers, drand beacon utilities, Prover/Verifier exposure).
  • Documentation

    • Added Validator Deployment guide and Docker README; updated local-subnet testing docs and scripts to use docker paths.
  • Chores

    • CI workflow for linting and matrix testing added; Ruff introduced and broad typing/style modernizations applied.
  • Tests

    • Added test for monitoring configuration behavior.

- Add GitHub Actions workflows for CI and Docker publishing
  - CI workflow runs tests with pytest and lints with ruff
  - Docker publish workflow builds on releases and manual triggers
  - Multi-architecture support (amd64/arm64) for Docker images

- Organize Docker files in dedicated docker/ directory
  - Move all Docker-related files to docker/ folder
  - Add docker-compose.validator.yml for production deployment
  - Include Watchtower for automatic container updates

- Configure ruff linter in pyproject.toml
  - Add ruff configuration with sensible defaults
  - Include in dev dependencies

- Add comprehensive validator deployment documentation
  - Document Watchtower auto-update setup
  - Provide environment configuration examples
  - Include troubleshooting and best practices

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Remove GitHub, HF, and Axon configuration options
- Update Watchtower interval to 30 seconds
- Simplify environment variable documentation
- Reflect actual docker-compose.validator.yml state
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 12, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds CI and Docker publish workflows, Docker validator compose and deployment docs; updates docker paths in scripts; adds Ruff config; modernizes type hints across many modules; expands package exports; refactors monitoring manager and WandB backend; introduces an emission-burn config and test for MonitoringConfig.

Changes

Cohort / File(s) Summary
CI Workflow
.github/workflows/ci.yml
New CI workflow: lint (Ruff) and test (pytest) jobs; Python matrix 3.9–3.11; uv-based install; triggers on push and pull_request.
Docker Publish Workflow
.github/workflows/docker-publish.yml
New Docker publish workflow for release/dispatch: optional test job, buildx/QEMU, GHCR push, metadata tagging, provenance attestations, amd64 build, caching.
Docker Assets & Docs
docker/README.md, docker/docker-compose.validator.yml, docs/VALIDATOR_DEPLOYMENT.md
Adds Docker README, validator docker-compose (GPU, volumes, env, healthcheck, Watchtower) and a validator deployment guide.
Docs: local subnet
docs/local-subnet-testing.md
Updated compose file references to docker/docker-compose.local-subnet.yml.
Tooling (Ruff)
pyproject.toml
Adds ruff dev dependency and [tool.ruff] lint/format configuration.
Scripts: docker path updates & utilities
scripts/setup-local-subnet.sh, scripts/run_integration_experiment.py, scripts/test-s3-credentials.py
Switches docker-compose invocations to docker/...; modernizes typing/annotations and logging/messages; minor formatting tweaks.
Core package exports
grail/__init__.py, grail/environments/__init__.py
Reorganizes and expands exports: exposes SAT types, drand helpers, comms HuggingFace functions, Prover/Verifier and PROTOCOL_VERSION.
CLI
grail/cli/__init__.py, grail/cli/mine.py, grail/cli/train.py, grail/cli/validate.py
Load .env defensively, centralize monitoring init, modernize type hints to built-in generics, add main() entry for validate, and small log-format changes.
Monitoring subsystem
grail/monitoring/*, grail/monitoring/backends/*
Modernize typing (PEP 585/604), export MonitoringConfig, improve manager lifecycle/flush/task resilience, backend init resilience, WandB text artifact handling and batching, update timer/gauge semantics.
Environments & Mining
grail/environments/*, grail/mining/rollout_generator.py
Migrate typing to built-ins; add SAT constants and SATProblem.solution; rollout dataclass adds completion_length; signature updates.
Infrastructure & comms
grail/infrastructure/*, grail/drand.py, chain.py
Typing modernization, import reorders, small exception-chaining fixes (raise ... from e) and minor style tweaks.
Core logic
grail/grail.py
Typing modernized, add function-level cache for r_vec_from_randomness, improved exception chaining, small log formatting tweaks.
Shared & tests
grail/shared/*, tests/*, experiments/run_all.py
Import-order/whitespace tweaks, add test test_monitoring_config_disabled_env, small formatting changes.
Env example
.env.example, grail/shared/constants.py
Adds emission-burn config vars: GRAIL_BURN_UID, GRAIL_BURN_PERCENTAGE and corresponding constants.
Misc whitespace
docker/Dockerfile, experiments/run_all.py
Small whitespace-only edits (no behavioral change).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as grail.cli (Typer)
  participant Env as .env loader
  participant MonCfg as MonitoringConfig
  participant MonMgr as MonitoringManager
  participant Backend as MonitoringBackend (WandB/Null)

  User->>CLI: invoke command
  CLI->>Env: load_dotenv(override=True)
  Env-->>CLI: env vars (best-effort)
  CLI->>MonCfg: from_environment()
  MonCfg-->>CLI: config dict
  CLI->>MonMgr: initialize(config)
  alt monitoring enabled
    MonMgr->>Backend: initialize(config)
    Backend-->>MonMgr: ok / fallback
  else disabled/failed
    MonMgr->>MonMgr: install NullBackend
  end
  CLI-->>User: run subcommand
Loading
sequenceDiagram
  autonumber
  participant GH as GitHub (release/dispatch)
  participant WF as docker-publish.yml
  participant Test as job:test
  participant Build as job:build-and-push
  participant Meta as docker/metadata-action
  participant Reg as ghcr.io

  GH->>WF: trigger (release/workflow_dispatch or release)
  alt skip_tests != 'true'
    WF->>Test: run pytest via uv
    Test-->>WF: success/fail
  else skip
    WF-->>WF: tests skipped
  end
  WF->>Build: proceed
  Build->>Meta: extract tags (release or dispatch)
  Build->>Reg: docker login
  Build->>Reg: buildx build+push (amd64) with cache & provenance
  Reg-->>Build: image digest & tags
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement, documentation

Suggested reviewers

  • distributedstatemachine

Poem

"A rabbit hops the CI vine, nudges docker into flight,
Types hop home to modern burrows under soft pale light.
Monitoring hums, rollouts queue, a burn UID in place—
Carrots built and images pushed, the garden finds its pace. 🥕"


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d20da9 and 593314e.

📒 Files selected for processing (6)
  • .env.example (1 hunks)
  • docker/Dockerfile (1 hunks)
  • docker/docker-compose.validator.yml (1 hunks)
  • docs/VALIDATOR_DEPLOYMENT.md (1 hunks)
  • grail/cli/validate.py (25 hunks)
  • grail/shared/constants.py (1 hunks)
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Changed --dev-dependencies to --dev to match current uv CLI
- Add proper exception chaining with 'from e' for B904 errors
- Remove unused Optional import from grail/cli/__init__.py
- All 269 ruff errors now fixed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/local-subnet-testing.md (1)

106-113: Make compose path consistent throughout the doc.

Normalize all occurrences to docker/docker-compose.local-subnet.yml.

- docker compose -f docker-compose.local-subnet.yml logs -f miner-1
+ docker compose -f docker/docker-compose.local-subnet.yml logs -f miner-1
- docker compose -f docker-compose.local-subnet.yml logs -f validator
+ docker compose -f docker/docker-compose.local-subnet.yml logs -f validator
- docker compose -f docker-compose.local-subnet.yml logs -f alice
+ docker compose -f docker/docker-compose.local-subnet.yml logs -f alice

- docker compose -f docker-compose.local-subnet.yml down
+ docker compose -f docker/docker-compose.local-subnet.yml down
- docker compose -f docker-compose.local-subnet.yml down -v
+ docker compose -f docker/docker-compose.local-subnet.yml down -v

Also applies to: 177-183

🧹 Nitpick comments (17)
scripts/setup-local-subnet.sh (2)

73-75: Minor typo in user-facing text.

-# No need to fund wallets - Alice already has 2funds
+# No need to fund wallets — Alice already has funds

4-4: Harden script error handling for pipelines.

Piped commands bypass set -e; add pipefail.

-set -e
+set -euo pipefail
pyproject.toml (2)

189-191: Ruff added only to dependency-groups; dev extra lacks it and flake8 versions diverge.

Keep a single source of truth for “dev” deps to avoid drift.

[project.optional-dependencies]
 dev = [
-    "pytest>=7.0.0",
+    "pytest>=7.0.0",
     "pytest-asyncio>=0.21.0",
     "black>=22.0.0",
-    "flake8>=5.0.0",
+    "flake8==7.3.0",
+    "ruff>=0.8.0",
     "mypy>=1.0.0",
     "pre-commit>=2.20.0",
     "bittensor-cli>=9.10.2",
     "zstandard>=0.21.0",
 ]

Optionally drop the duplicate [dependency-groups].dev or keep both in sync.


11-14: Author name typo.

-    {name = "distriburedstatemachine"},
+    {name = "distributedstatemachine"},
.github/workflows/ci.yml (1)

1-58: Trim trailing spaces and add EOF newline to satisfy yamllint.

Also consider a concurrency group to auto-cancel superseded runs on PRs:

 name: CI
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
docker/.env.validator.example (1)

9-10: Optional ordering and newline per dotenv-linter.

-WALLET_NAME=your_wallet_name
-WALLET_HOTKEY=your_hotkey_name
+WALLET_HOTKEY=your_hotkey_name
+WALLET_NAME=your_wallet_name

And add a trailing newline at file end.

docs/VALIDATOR_DEPLOYMENT.md (2)

49-50: Prefer docker compose over docker-compose for consistency with other docs.

-docker-compose -f docker/docker-compose.validator.yml --env-file docker/.env.validator up -d
+docker compose -f docker/docker-compose.validator.yml --env-file docker/.env.validator up -d
...
-docker-compose -f docker/docker-compose.validator.yml restart validator
+docker compose -f docker/docker-compose.validator.yml restart validator

Also applies to: 143-144


212-213: Wrap bare URLs in Markdown links (mdlint).

-- GitHub Issues: https://github.com/tplr-ai/grail/issues
-- Documentation: https://github.com/tplr-ai/grail/blob/main/SPEC.md
+- GitHub Issues: [https://github.com/tplr-ai/grail/issues](https://github.com/tplr-ai/grail/issues)
+- Documentation: [https://github.com/tplr-ai/grail/blob/main/SPEC.md](https://github.com/tplr-ai/grail/blob/main/SPEC.md)
.github/workflows/docker-publish.yml (1)

1-127: Solid multi-arch build/push with provenance; clean tag strategy.

Minor: remove trailing spaces, add EOF newline to satisfy yamllint. Optionally add concurrency to avoid duplicate manual runs.

 name: Docker Publish
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
docker/docker-compose.validator.yml (2)

55-60: Healthcheck is simplistic but acceptable. Consider a first-party health endpoint if available.

If grail exposes a health command, prefer:

healthcheck:
  test: ["CMD", "uv", "run", "grail", "health"]

1-86: Trim trailing spaces and add EOF newline.

docker/README.md (6)

25-27: Use modern Compose CLI (“docker compose”) and note the minimum version.

The legacy docker-compose is deprecated. Prefer docker compose and document the version requirement.

-# Start validator with auto-updates
-docker-compose -f docker/docker-compose.validator.yml --env-file docker/.env.validator up -d
+# Start validator with auto-updates (Docker Engine 20.10+ and Compose v2)
+docker compose -f docker/docker-compose.validator.yml --env-file docker/.env.validator up -d

12-13: Auto-updates with Watchtower: pin tags and scope updates.

Unpinned tags can pull breaking changes. Recommend pinned image tags and selective Watchtower enablement via labels.

You can add to README:

- - **`docker-compose.validator.yml`** - Production deployment with Watchtower for automatic updates
+ - **`docker-compose.validator.yml`** - Production deployment; enable Watchtower for selective, pinned updates.
+   - Pin image tags (e.g., `grail:<major>.<minor>`), and use `com.centurylinklabs.watchtower.enable="true"` labels to opt-in containers.

16-20: Add prerequisites before Quick Start.

Clarifies required tooling and optional GPU setup.

 ## Quick Start
+
+### Prerequisites
+- Docker Engine 20.10+ and Docker Compose v2.x
+- Optional: NVIDIA Container Toolkit if running with GPUs
+
 ### Deploy a Validator

32-37: Add teardown/cleanup commands for local runs.

Helps avoid orphaned volumes/networks.

 docker-compose -f docker/docker-compose.local-subnet.yml up -d
+
+# Teardown and remove volumes when done
+# (use 'docker compose' if on Compose v2)
+docker-compose -f docker/docker-compose.local-subnet.yml down -v

47-48: Clarify build context and enable BuildKit for speed.

Minor doc tweak for reliability/perf when building from repo root.

-docker build -f docker/Dockerfile -t grail:local .
+DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile -t grail:local .
+# Run from the repo root so build context (.) includes the source tree.

7-9: Clarify image scope (miners vs validators).

If miners and validators diverge, consider separate targets/images or multi-stage targets documented here to avoid bloated images.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f73109 and 9c356a7.

📒 Files selected for processing (9)
  • .github/workflows/ci.yml (1 hunks)
  • .github/workflows/docker-publish.yml (1 hunks)
  • docker/.env.validator.example (1 hunks)
  • docker/README.md (1 hunks)
  • docker/docker-compose.validator.yml (1 hunks)
  • docs/VALIDATOR_DEPLOYMENT.md (1 hunks)
  • docs/local-subnet-testing.md (2 hunks)
  • pyproject.toml (1 hunks)
  • scripts/setup-local-subnet.sh (3 hunks)
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
docker/.env.validator.example

[warning] 10-10: [UnorderedKey] The WALLET_HOTKEY key should go before the WALLET_NAME key

(UnorderedKey)


[warning] 22-22: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml

[error] 15-15: trailing spaces

(trailing-spaces)


[error] 21-21: trailing spaces

(trailing-spaces)


[error] 24-24: trailing spaces

(trailing-spaces)


[error] 27-27: trailing spaces

(trailing-spaces)


[error] 30-30: trailing spaces

(trailing-spaces)


[error] 42-42: trailing spaces

(trailing-spaces)


[error] 48-48: trailing spaces

(trailing-spaces)


[error] 51-51: trailing spaces

(trailing-spaces)


[error] 54-54: trailing spaces

(trailing-spaces)


[error] 58-58: no new line character at the end of file

(new-line-at-end-of-file)

docker/docker-compose.validator.yml

[error] 16-16: trailing spaces

(trailing-spaces)


[error] 21-21: trailing spaces

(trailing-spaces)


[error] 25-25: trailing spaces

(trailing-spaces)


[error] 29-29: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[error] 41-41: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[error] 52-52: trailing spaces

(trailing-spaces)


[error] 60-60: trailing spaces

(trailing-spaces)


[error] 86-86: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/docker-publish.yml

[error] 33-33: trailing spaces

(trailing-spaces)


[error] 39-39: trailing spaces

(trailing-spaces)


[error] 42-42: trailing spaces

(trailing-spaces)


[error] 45-45: trailing spaces

(trailing-spaces)


[error] 59-59: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 70-70: trailing spaces

(trailing-spaces)


[error] 83-83: trailing spaces

(trailing-spaces)


[error] 94-94: trailing spaces

(trailing-spaces)


[error] 97-97: trailing spaces

(trailing-spaces)


[error] 114-114: trailing spaces

(trailing-spaces)


[error] 121-121: trailing spaces

(trailing-spaces)


[error] 127-127: no new line character at the end of file

(new-line-at-end-of-file)

🪛 markdownlint-cli2 (0.17.2)
docs/VALIDATOR_DEPLOYMENT.md

212-212: Bare URL used

(MD034, no-bare-urls)


213-213: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (10)
scripts/setup-local-subnet.sh (2)

161-161: Compose path update looks good.


95-101: Registration success check is brittle (emoji/string match).

btcli output may change; prefer exit code or a clearer grep token if available.

Do we have a stable success string or a --json flag to parse? If yes, I can patch this block accordingly.

pyproject.toml (1)

192-217: Ruff config LGTM and matches CI usage.

.github/workflows/ci.yml (2)

9-33: Lint + format checks are correct and align with pyproject.


34-58: Matrix tests look good; set PYTHONPATH is helpful.

docker/.env.validator.example (1)

15-22: Sane defaults; clear REQUIRED/OPTIONAL split.

.github/workflows/docker-publish.yml (1)

71-114: Tag selection with fallback is fine; verify BUILD_DATE source on workflow_dispatch.

On workflow_dispatch, github.event.release.* is undefined. Confirm BUILD_DATE fallback works, or set it explicitly via:

-            BUILD_DATE=${{ github.event.release.created_at || github.event.repository.updated_at }}
+            BUILD_DATE=${{ github.event.release.created_at || github.event.repository.updated_at || github.run_id }}
docker/docker-compose.validator.yml (2)

42-49: GPU reservation may be ignored by non‑Swarm docker compose.

Compose often ignores deploy.resources.* unless in Swarm. Ensure GPUs are actually visible inside the container.

If not, switch to the compose-supported GPUs flag:

 services:
   validator:
+    # Request one GPU in non‑Swarm compose
+    deploy: {}
+    gpus: 1
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: 1
-              capabilities: [gpu]

Or keep deploy.resources and add gpus: 1 for broad compatibility.


36-41: Mount points and persistence look right; wallet is read-only.

docker/README.md (1)

52-54: Verified — linked docs and paths are correct.
File: docker/README.md (lines 52–54): docs/VALIDATOR_DEPLOYMENT.md, docs/local-subnet-testing.md and the referenced docker files exist at the stated paths; no broken links.


```bash
docker compose -f docker-compose.local-subnet.yml up -d alice bob s3 s3-setup
docker compose -f docker/docker-compose.local-subnet.yml up -d alice bob s3 s3-setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Path fix applied in two places; remaining references still point to old location.

Several compose commands below still use docker-compose.local-subnet.yml without the docker/ prefix; this will confuse users.

Apply:

- docker compose -f docker-compose.local-subnet.yml logs -f
+ docker compose -f docker/docker-compose.local-subnet.yml logs -f

And similarly update Lines 109–113, 177, 182 to use docker/docker-compose.local-subnet.yml.

Also applies to: 98-98

🤖 Prompt for AI Agents
In docs/local-subnet-testing.md around lines 46, 98, 109-113, 177 and 182 update
any docker compose commands that reference docker-compose.local-subnet.yml to
include the docker/ prefix (i.e. change occurrences of
"docker-compose.local-subnet.yml" to "docker/docker-compose.local-subnet.yml")
so all references point to the new path; scan the file for any remaining uses of
the old filename and replace them consistently.

Comment on lines +179 to 187
echo " All services: docker compose -f docker/docker-compose.local-subnet.yml logs -f"
echo " Miner 1 only: docker compose -f docker/docker-compose.local-subnet.yml logs -f miner-1"
echo " Validator: docker compose -f docker/docker-compose.local-subnet.yml logs -f validator"
echo ""
echo "Rebuild and restart services (after code changes):"
echo " All GRAIL: docker compose -f docker-compose.local-subnet.yml up -d --build miner-1 miner-2 validator"
echo " Miner 1 only: docker compose -f docker-compose.local-subnet.yml up -d --build miner-1"
echo " Validator: docker compose -f docker-compose.local-subnet.yml up -d --build validator"
echo " All GRAIL: docker compose -f docker/docker-compose.local-subnet.yml up -d --build miner-1 miner-2 validator"
echo " Miner 1 only: docker compose -f docker/docker-compose.local-subnet.yml up -d --build miner-1"
echo " Validator: docker compose -f docker/docker-compose.local-subnet.yml up -d --build validator"
echo ""docker compose -nf docker-compose.local-subnet.yml logs -f miner-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

One malformed echo and stray flags; also one old compose path.

Line 187 has a broken quote and -nf; should be -f with the new docker/ path.

-echo ""docker compose -nf docker-compose.local-subnet.yml logs -f miner-1
+echo "  Miner 1 only:  docker compose -f docker/docker-compose.local-subnet.yml logs -f miner-1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo " All services: docker compose -f docker/docker-compose.local-subnet.yml logs -f"
echo " Miner 1 only: docker compose -f docker/docker-compose.local-subnet.yml logs -f miner-1"
echo " Validator: docker compose -f docker/docker-compose.local-subnet.yml logs -f validator"
echo ""
echo "Rebuild and restart services (after code changes):"
echo " All GRAIL: docker compose -f docker-compose.local-subnet.yml up -d --build miner-1 miner-2 validator"
echo " Miner 1 only: docker compose -f docker-compose.local-subnet.yml up -d --build miner-1"
echo " Validator: docker compose -f docker-compose.local-subnet.yml up -d --build validator"
echo " All GRAIL: docker compose -f docker/docker-compose.local-subnet.yml up -d --build miner-1 miner-2 validator"
echo " Miner 1 only: docker compose -f docker/docker-compose.local-subnet.yml up -d --build miner-1"
echo " Validator: docker compose -f docker/docker-compose.local-subnet.yml up -d --build validator"
echo ""docker compose -nf docker-compose.local-subnet.yml logs -f miner-1
echo " All services: docker compose -f docker/docker-compose.local-subnet.yml logs -f"
echo " Miner 1 only: docker compose -f docker/docker-compose.local-subnet.yml logs -f miner-1"
echo " Validator: docker compose -f docker/docker-compose.local-subnet.yml logs -f validator"
echo ""
echo "Rebuild and restart services (after code changes):"
echo " All GRAIL: docker compose -f docker/docker-compose.local-subnet.yml up -d --build miner-1 miner-2 validator"
echo " Miner 1 only: docker compose -f docker/docker-compose.local-subnet.yml up -d --build miner-1"
echo " Validator: docker compose -f docker/docker-compose.local-subnet.yml up -d --build validator"
echo " Miner 1 only: docker compose -f docker/docker-compose.local-subnet.yml logs -f miner-1"
🤖 Prompt for AI Agents
In scripts/setup-local-subnet.sh around lines 179 to 187, there is a malformed
echo on line 187 containing a stray quote and the wrong flag "-nf" and an
outdated compose path; replace that broken echo with a corrected echo that uses
the -f flag and the current docker/docker-compose.local-subnet.yml path and
remove the stray quote so the line consistently prints the "Miner 1 only" logs
command using the new path and proper -f flag.

Changed 'uv sync --dev' to 'uv sync --extra dev' to properly install
pytest and other dev dependencies from project.optional-dependencies
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…r docker config

- Added all required R2/S3 storage credentials to docker-compose.validator.yml
- Fixed environment variable naming (BT_WALLET_COLD/HOT, BT_NETWORK)
- Added model configuration variables
- Enhanced monitoring configuration with all WandB options
- Updated .env.validator.example with required R2 credentials
- Updated documentation to reflect all required variables

Validators now have proper configuration parity with integration setup
- Remove linux/arm64 platform from docker build
- Add QEMU setup step (though only building amd64 now)
- Fixes buildx QEMU emulator error in CI
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
tests/test_basics.py (1)

19-26: Test has a logic inconsistency

The test expects is_monitoring_enabled() to return False on line 21, but then expects backend_type to be "null" on line 24. However, when monitoring is disabled, the mode could still be any of the valid options. The test should either:

  1. Check that monitoring is disabled OR backend_type is "null"
  2. Verify the consistency between the two conditions

Apply this fix to make the test assertions consistent:

 def test_monitoring_config_disabled_env() -> None:
     # With defaults from conftest, monitoring should be effectively disabled
-    assert MonitoringConfig.is_monitoring_enabled() is False
     cfg = MonitoringConfig.from_environment()
-    # Ensure the parsed configuration reflects disabled/NULL backend
-    assert cfg["backend_type"] == "null"
-    assert cfg["mode"] in {"disabled", "online", "offline"}
+    # Either monitoring is disabled OR using null backend
+    is_disabled = not MonitoringConfig.is_monitoring_enabled()
+    is_null = cfg["backend_type"] == "null"
+    assert is_disabled or is_null, "Monitoring should be disabled or using null backend"
+    assert cfg["mode"] in {"disabled", "online", "offline"}
grail/grail.py (1)

123-126: Potential issue with dynamic cache attribute initialization

The function is adding a cache attribute to itself dynamically, but the docstring appears after this initialization. This is unconventional and could cause issues with documentation tools or linters. The cache initialization should be moved after the docstring or implemented differently.

Consider moving the cache initialization after the docstring or using a module-level cache:

 def r_vec_from_randomness(rand_hex: str, d_model: int) -> torch.Tensor:
-    # Add cache attribute to function
-    if not hasattr(r_vec_from_randomness, "_cache"):
-        # Initialize a simple dict cache; attribute added dynamically
-        r_vec_from_randomness._cache = {}  # type: ignore[attr-defined]
     """
     Generate random projection vector from drand randomness.

     Takes drand randomness (32 bytes hex) and expands it deterministically
     into a d_model-dimensional vector using a PRF. This ensures everyone
     with the same drand value generates the same projection vector.

     Args:
         rand_hex: Hex string of drand randomness (typically from drand beacon)
         d_model: Model hidden dimension size

     Returns:
         Random projection vector of shape (d_model,) with int32 values

     Raises:
         ValueError: If rand_hex is invalid or d_model is invalid

     Note:
         Uses big-endian byte order for cross-platform consistency
     """
+    # Add cache attribute to function
+    if not hasattr(r_vec_from_randomness, "_cache"):
+        # Initialize a simple dict cache; attribute added dynamically
+        r_vec_from_randomness._cache = {}  # type: ignore[attr-defined]
grail/monitoring/manager.py (1)

74-84: Potential race condition in async component initialization

The _ensure_async_components method checks and creates the shutdown event and flush task, but there's a potential race condition if this method is called concurrently from multiple coroutines. Consider using a lock to ensure thread-safe initialization.

Apply this diff to add thread-safe initialization:

 def __init__(self, backend: MonitoringBackend | None = None):
     """Initialize the monitoring manager.
     
     Args:
         backend: The monitoring backend to use. If None, uses NullBackend.
     """
     self.backend = backend or NullBackend()
     self._metric_buffer: list[MetricData] = []
     self._buffer_size = 100
     self._flush_interval = 30.0  # seconds
     self._flush_task: asyncio.Task | None = None
     self._shutdown_event: asyncio.Event | None = None
     self._initialized = False
     self._config: dict[str, Any] = {}
     self._current_block: int | None = None
     self._current_window: int | None = None
     self._start_time: float | None = None
+    self._init_lock = asyncio.Lock()

-def _ensure_async_components(self) -> None:
-    """Ensure async components are initialized (lazy initialization)."""
-    if self._shutdown_event is None:
-        self._shutdown_event = asyncio.Event()
-    if self._flush_task is None or self._flush_task.done():
-        self._flush_task = asyncio.create_task(self._periodic_flush())
+async def _ensure_async_components(self) -> None:
+    """Ensure async components are initialized (lazy initialization)."""
+    async with self._init_lock:
+        if self._shutdown_event is None:
+            self._shutdown_event = asyncio.Event()
+        if self._flush_task is None or self._flush_task.done():
+            self._flush_task = asyncio.create_task(self._periodic_flush())

Then update the call site at line 130:

-    self._ensure_async_components()
+    await self._ensure_async_components()
chain.py (4)

136-141: Commit format must be 128 chars; enforce pad/truncate before on-chain write.

Current concat may produce wrong length and fail validation.

-        concatenated = bucket.account_id + bucket.access_key_id + bucket.secret_access_key
+        # 32 account_id + 32 access_key_id + 64 secret_access_key
+        acc = (bucket.account_id or "")[:32].ljust(32)
+        key = (bucket.access_key_id or "")[:32].ljust(32)
+        sec = (bucket.secret_access_key or "")[:64].ljust(64)
+        concatenated = acc + key + sec
+        if len(concatenated) != 128:
+            raise ValueError("Commitment must be exactly 128 characters.")

162-175: Redact secrets in logs.

Both comparison and commit logs expose access keys; mask or hash instead.

-            logger.debug(
-                "Comparing current commitment to bucket details from the environment:\n"
-                f"Commitment: {commitment_str}\n"
-                f"Current: {bucket_details_from_env}"
-            )
+            def _mask(s: str) -> str:
+                return (s[:4] + "…" + s[-4:]) if s else ""
+            logger.debug(
+                "Comparing current commitment to bucket details from the environment:\n"
+                f"Commitment: {_mask(commitment_str)} (len={len(commitment_str)})\n"
+                f"Current: {_mask(bucket_details_from_env)} (len={len(bucket_details_from_env)})"
+            )

226-235: Do not log full on-chain commitment; keep exception chaining.

Good use of “raise … from e”, but the info log leaks secrets. Downgrade and redact.

-            concatenated = self.subtensor.get_commitment(self.netuid, uid)
-            logger.info(f"Commitment fetched: {concatenated}")
+            concatenated = self.subtensor.get_commitment(self.netuid, uid)
+            logger.debug("Commitment fetched (len=%d)", len(concatenated))

304-309: Return empty dict on error to match annotation and callers.

Currently returns None; keep type stable.

-        except Exception:
+        except Exception:
             self.subtensor.substrate.close()
             self.subtensor.substrate.initialize()
-            return
+            return {}
grail/cli/train.py (1)

293-296: grad_norm may be undefined; fix and correct avg_loss denominator.

If training is skipped early (health check), grad_norm is never set; avg_loss divisor is also off by +1.

@@
-            for epoch in range(2):  # Two epochs for better learning
+            for epoch in range(2):  # Two epochs for better learning
                 total_loss = 0
                 batch_size = min(4, len(texts))  # Small batch size
+                last_grad_norm = 0.0
@@
-                    grad_norm = torch.nn.utils.clip_grad_norm_(
+                    grad_norm = torch.nn.utils.clip_grad_norm_(
                         self.model.parameters(), max_norm=0.5
                     )  # Reduced from 1.0
+                    last_grad_norm = float(grad_norm)
@@
-                avg_loss = total_loss / (len(texts) // batch_size + 1)
+                num_batches = (len(sorted_indices) + batch_size - 1) // batch_size
+                avg_loss = total_loss / max(1, num_batches)
                 logger.info(f"Epoch {epoch + 1} completed - avg loss: {avg_loss:.4f}")
@@
                 if monitor:
                     await monitor.log_gauge("training.epoch_loss", avg_loss)
                     await monitor.log_counter("training.epochs_completed")
-                    await monitor.log_gauge("training.gradient_norm", float(grad_norm))
+                    await monitor.log_gauge("training.gradient_norm", float(last_grad_norm))

Also applies to: 349-357, 383-392

🧹 Nitpick comments (18)
.github/workflows/docker-publish.yml (3)

33-33: Clean up trailing whitespace in YAML file.

YAMLlint detected trailing spaces on multiple lines. While not critical, removing them improves consistency.

Apply this diff to remove trailing spaces:

-      - uses: actions/checkout@v4
-      
+      - uses: actions/checkout@v4
+
-      
-      - name: Set up Python
+
+      - name: Set up Python

And similar changes for lines 39, 42, 45, 59, 63, 70, 83, 94, 97, 100, 117, 124.

Also applies to: 39-39, 42-42, 45-45, 59-59, 63-63, 70-70, 83-83, 94-94, 97-97, 100-100, 117-117, 124-124, 130-130


130-130: Add newline at end of file.

YAMLlint recommends adding a newline character at the end of the file.

-          echo "Digest: ${{ steps.push.outputs.digest }}"
+          echo "Digest: ${{ steps.push.outputs.digest }}"
+

107-107: Consider adding support for ARM architecture.

The workflow currently builds only for linux/amd64. Consider adding linux/arm64 to support ARM-based deployments, which are becoming increasingly common.

-          platforms: linux/amd64
+          platforms: linux/amd64,linux/arm64
docker/docker-compose.validator.yml (3)

16-16: Remove trailing spaces from YAML file.

YAMLlint identified trailing spaces that should be removed for consistency.

Apply similar cleanup as suggested for the GitHub workflow file - remove trailing spaces from the indicated lines.

Also applies to: 21-21, 31-31, 41-41, 46-46, 52-52, 58-58, 64-64, 72-72, 75-75, 83-83


109-109: Add newline at end of file.

-    driver: local
+    driver: local
+

94-94: Consider increasing Watchtower interval for production.

The current 30-second interval for checking updates is quite aggressive and may cause unnecessary load. Consider increasing this to 300 seconds (5 minutes) or more for production deployments.

-    command: --interval 30 --cleanup --label-enable
+    command: --interval 300 --cleanup --label-enable
docs/VALIDATOR_DEPLOYMENT.md (2)

230-231: Format bare URLs as proper markdown links.

Markdownlint detected bare URLs that should be formatted as proper links.

-- GitHub Issues: https://github.com/tplr-ai/grail/issues
-- Documentation: https://github.com/tplr-ai/grail/blob/main/SPEC.md
+- GitHub Issues: [https://github.com/tplr-ai/grail/issues](https://github.com/tplr-ai/grail/issues)
+- Documentation: [https://github.com/tplr-ai/grail/blob/main/SPEC.md](https://github.com/tplr-ai/grail/blob/main/SPEC.md)

94-96: Document the aggressive update interval implications.

The 30-second update check interval is very aggressive. Consider documenting the potential implications (increased network traffic, API rate limits) and suggesting production-appropriate values.

 ### Update Schedule
 
 - **Current Setting**: Checks every 30 seconds for rapid updates
 - This aggressive polling ensures validators stay up-to-date with the latest releases
+- **Note**: For production deployments, consider increasing the interval to reduce load:
+  - Development/Testing: 30 seconds
+  - Staging: 5 minutes (300 seconds)
+  - Production: 30 minutes (1800 seconds)
 - The image is public on GitHub Container Registry, no authentication needed
grail/cli/__init__.py (1)

113-115: Redundant monitoring check

In silent mode (verbosity == 0), the code checks MonitoringConfig.is_monitoring_enabled() again on line 114, even though it was already checked on line 100. This seems redundant.

Consider simplifying the logic:

         elif verbosity == 0:
             # Silent mode - but still allow monitoring if explicitly enabled
-            if not MonitoringConfig.is_monitoring_enabled():
-                config["backend_type"] = "null"
+            # Force null backend in silent mode unless explicitly configured
+            config["backend_type"] = "null"
grail/infrastructure/comms.py (1)

76-77: Unused variable assignments can be removed

Lines 77 and 83 contain standalone expressions that appear to be extracting the bucket name but not using the result. These were likely left over from refactoring.

Remove these unused expressions:

-            credentials.name.strip()
         elif isinstance(credentials, dict):
             # Handle dict format (from chain commitments or legacy)
             account_id = credentials.get("account_id", "").strip()
             access_key = credentials.get("access_key_id", "").strip()
             secret_key = credentials.get("secret_access_key", "").strip()
-            credentials.get("name", credentials.get("bucket_name", "")).strip()

Also applies to: 82-83

grail/mining/rollout_generator.py (1)

266-266: Import placement could be improved

The import statement for GRAIL functions is placed inside the method at runtime. While this works, it's generally better practice to have imports at the module level unless there's a specific reason for lazy loading (e.g., circular dependencies).

Consider moving this import to the top of the file with other imports:

 import bittensor as bt
 import torch

+from ..grail import dot_mod_q, r_vec_from_randomness, sign_s_vals
 from ..shared.constants import MAX_NEW_TOKENS
 from ..shared.hf_compat import resolve_hidden_size

 ...

     # Generate GRAIL proof components (using existing grail.py logic)
-    from ..grail import dot_mod_q, r_vec_from_randomness, sign_s_vals
grail/monitoring/manager.py (1)

202-202: Minor issue: Inconsistent return type annotation

The timer method's return type is annotated as Any, but it should be a context manager type. Consider using Generator[None, None, None] from collections.abc for consistency with the base class.

+from collections.abc import Generator

 @contextmanager
-def timer(self, name: str, tags: dict[str, str] | None = None) -> Any:
+def timer(self, name: str, tags: dict[str, str] | None = None) -> Generator[None, None, None]:
grail/shared/hf_compat.py (2)

31-37: Tighten attribute access (simplify and avoid hasattr+try).

Use getattr with a default to drop hasattr plus surrounding try.

-        text_cfg = getattr(cfg, "text_config", None)
-        if text_cfg is not None and hasattr(text_cfg, "hidden_size"):
-            val = text_cfg.hidden_size
+        text_cfg = getattr(cfg, "text_config", None)
+        if text_cfg is not None:
+            val = getattr(text_cfg, "hidden_size", None)
             if isinstance(val, int) and val > 0:
                 return val

74-96: Consider rope-scaled context and tokenizer fallback.

Optional: also inspect config.rope_scaling.max_position_embeddings (if dict-like) and, if available in the calling context, tokenizer.model_max_length before defaulting to 16384.

chain.py (2)

207-214: Docstring order vs implementation.

Docs say “bucket name, account ID, …” but code uses account_id as both name and id. Clarify to “account_id (also used as name), access_key_id, secret_access_key”.

-        **Note:** The order of fields (bucket name, account ID, access key ID,
+        **Note:** The order of fields (account ID (also used as name), access key ID,

377-385: Comment threshold mismatch.

Code filters stake <= 20000 but comment says 1000. Align the comment.

-        # for peers still active with stake <= 1000.
+        # for peers still active with stake <= 20000.
.github/workflows/ci.yml (1)

14-33: Trim trailing spaces and add newline at EOF.

YAMLlint errors; also add minimal hardening.

 name: CI
 
 on:
   push:
     branches: [ '**' ]
   pull_request:
     branches: [ '**' ]
 
 jobs:
   lint:
     name: Lint with Ruff
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
-      
+
       - name: Install uv
         uses: astral-sh/setup-uv@v4
         with:
           enable-cache: true
           cache-dependency-glob: "uv.lock"
-      
+
       - name: Set up Python
         run: uv python install 3.11
-      
+
       - name: Install dependencies
         run: uv sync --extra dev
-      
+
       - name: Run Ruff linter
         run: uv run ruff check .
-      
+
       - name: Run Ruff formatter check
         run: uv run ruff format --check .
 
   test:
     name: Run Tests
     runs-on: ubuntu-latest
     strategy:
       matrix:
         python-version: ["3.9", "3.10", "3.11"]
     steps:
       - uses: actions/checkout@v4
-      
+
       - name: Install uv
         uses: astral-sh/setup-uv@v4
         with:
           enable-cache: true
           cache-dependency-glob: "uv.lock"
-      
+
       - name: Set up Python ${{ matrix.python-version }}
         run: uv python install ${{ matrix.python-version }}
-      
+
       - name: Install dependencies
         run: uv sync --extra dev
-      
+
       - name: Run tests with pytest
         run: uv run pytest tests/ -v --tb=short
         env:
           PYTHONPATH: ${{ github.workspace }}
+```
+```diff
+--- a/.github/workflows/ci.yml
++++ b/.github/workflows/ci.yml
+@@
+ name: CI
+ +permissions:
+ +  contents: read
+ +concurrency:
+ +  group: ci-${{ github.ref }}
+ +  cancel-in-progress: true

If desired, I can push a follow-up PR to run yamllint in CI to catch these automatically.

Also applies to: 42-58

docker/.env.validator.example (1)

8-21: Reorder keys for readability and dotenv linters; add EOF newline.

No functional change; improves DX.

-# Bittensor Wallet Configuration
-WALLET_NAME=your_wallet_name
-WALLET_HOTKEY=your_hotkey_name
+# Bittensor Wallet Configuration
+WALLET_HOTKEY=your_hotkey_name
+WALLET_NAME=your_wallet_name
@@
-# For Cloudflare R2:
-R2_ENDPOINT_URL=https://your-account-id.r2.cloudflarestorage.com
-R2_BUCKET_ID=your-bucket-name
-R2_ACCOUNT_ID=your-account-id
-R2_WRITE_ACCESS_KEY_ID=your-write-access-key
-R2_WRITE_SECRET_ACCESS_KEY=your-write-secret-key
-R2_READ_ACCESS_KEY_ID=your-read-access-key
-R2_READ_SECRET_ACCESS_KEY=your-read-secret-key
+# For Cloudflare R2:
+R2_ACCOUNT_ID=your-account-id
+R2_BUCKET_ID=your-bucket-name
+R2_ENDPOINT_URL=https://your-account-id.r2.cloudflarestorage.com
+R2_READ_ACCESS_KEY_ID=your-read-access-key
+R2_READ_SECRET_ACCESS_KEY=your-read-secret-key
+R2_WRITE_ACCESS_KEY_ID=your-write-access-key
+R2_WRITE_SECRET_ACCESS_KEY=your-write-secret-key

Also add a trailing newline at the end of the file.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c356a7 and d0ad846.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (37)
  • .github/workflows/ci.yml (1 hunks)
  • .github/workflows/docker-publish.yml (1 hunks)
  • chain.py (3 hunks)
  • docker/.env.validator.example (1 hunks)
  • docker/README.md (1 hunks)
  • docker/docker-compose.validator.yml (1 hunks)
  • docs/VALIDATOR_DEPLOYMENT.md (1 hunks)
  • experiments/run_all.py (0 hunks)
  • grail/__init__.py (1 hunks)
  • grail/cli/__init__.py (3 hunks)
  • grail/cli/mine.py (12 hunks)
  • grail/cli/train.py (6 hunks)
  • grail/cli/validate.py (22 hunks)
  • grail/drand.py (11 hunks)
  • grail/environments/__init__.py (1 hunks)
  • grail/environments/base.py (4 hunks)
  • grail/environments/sat.py (13 hunks)
  • grail/grail.py (8 hunks)
  • grail/infrastructure/chain.py (4 hunks)
  • grail/infrastructure/comms.py (13 hunks)
  • grail/infrastructure/credentials.py (1 hunks)
  • grail/infrastructure/drand.py (3 hunks)
  • grail/infrastructure/network.py (1 hunks)
  • grail/mining/rollout_generator.py (11 hunks)
  • grail/monitoring/__init__.py (1 hunks)
  • grail/monitoring/backends/__init__.py (1 hunks)
  • grail/monitoring/backends/null_backend.py (5 hunks)
  • grail/monitoring/backends/wandb_backend.py (10 hunks)
  • grail/monitoring/base.py (6 hunks)
  • grail/monitoring/config.py (7 hunks)
  • grail/monitoring/manager.py (9 hunks)
  • grail/shared/hf_compat.py (3 hunks)
  • grail/shared/schemas.py (1 hunks)
  • scripts/run_integration_experiment.py (10 hunks)
  • scripts/test-s3-credentials.py (6 hunks)
  • tests/conftest.py (1 hunks)
  • tests/test_basics.py (1 hunks)
💤 Files with no reviewable changes (1)
  • experiments/run_all.py
✅ Files skipped from review due to trivial changes (3)
  • grail/shared/schemas.py
  • grail/infrastructure/credentials.py
  • tests/conftest.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker/README.md
🧰 Additional context used
🧬 Code graph analysis (17)
grail/monitoring/__init__.py (2)
grail/monitoring/base.py (3)
  • MetricData (29-43)
  • MetricType (19-25)
  • MonitoringBackend (46-165)
grail/monitoring/config.py (1)
  • MonitoringConfig (14-236)
grail/monitoring/backends/__init__.py (1)
grail/monitoring/backends/wandb_backend.py (1)
  • WandBBackend (24-473)
grail/environments/__init__.py (1)
grail/environments/sat.py (3)
  • SATProblem (24-60)
  • SATRolloutGenerator (337-523)
  • create_sat_reward_vector (281-334)
grail/infrastructure/chain.py (1)
grail/shared/schemas.py (2)
  • Bucket (63-88)
  • BucketCredentials (6-60)
grail/monitoring/base.py (3)
grail/monitoring/backends/null_backend.py (4)
  • initialize (26-32)
  • log_metrics (42-48)
  • timer (51-61)
  • start_run (73-83)
grail/monitoring/backends/wandb_backend.py (4)
  • initialize (43-81)
  • log_metrics (180-218)
  • timer (324-346)
  • start_run (427-445)
grail/monitoring/manager.py (3)
  • initialize (49-72)
  • timer (202-218)
  • start_run (268-285)
grail/infrastructure/drand.py (1)
grail/drand.py (1)
  • get_current_chain (249-263)
grail/monitoring/backends/wandb_backend.py (2)
grail/monitoring/base.py (7)
  • MetricData (29-43)
  • MetricType (19-25)
  • MonitoringBackend (46-165)
  • initialize (55-70)
  • log_metrics (86-96)
  • timer (100-114)
  • start_run (128-138)
grail/monitoring/manager.py (3)
  • initialize (49-72)
  • timer (202-218)
  • start_run (268-285)
grail/monitoring/manager.py (3)
grail/monitoring/backends/null_backend.py (4)
  • NullBackend (18-103)
  • initialize (26-32)
  • timer (51-61)
  • start_run (73-83)
grail/monitoring/backends/wandb_backend.py (4)
  • WandBBackend (24-473)
  • initialize (43-81)
  • timer (324-346)
  • start_run (427-445)
grail/monitoring/base.py (6)
  • MetricData (29-43)
  • MetricType (19-25)
  • MonitoringBackend (46-165)
  • initialize (55-70)
  • timer (100-114)
  • start_run (128-138)
grail/monitoring/backends/null_backend.py (3)
grail/monitoring/base.py (6)
  • MetricData (29-43)
  • MonitoringBackend (46-165)
  • initialize (55-70)
  • log_metrics (86-96)
  • timer (100-114)
  • start_run (128-138)
grail/monitoring/backends/wandb_backend.py (4)
  • initialize (43-81)
  • log_metrics (180-218)
  • timer (324-346)
  • start_run (427-445)
grail/monitoring/manager.py (3)
  • initialize (49-72)
  • timer (202-218)
  • start_run (268-285)
grail/mining/rollout_generator.py (2)
grail/grail.py (3)
  • dot_mod_q (293-303)
  • r_vec_from_randomness (122-214)
  • sign_s_vals (306-327)
grail/environments/sat.py (2)
  • create_prompt (381-395)
  • get_final_info (487-518)
tests/test_basics.py (1)
grail/monitoring/config.py (1)
  • MonitoringConfig (14-236)
grail/drand.py (1)
grail/infrastructure/drand.py (3)
  • get_current_chain (82-84)
  • get_drand_beacon (87-137)
  • get_mock_beacon (140-146)
grail/cli/__init__.py (3)
grail/cli/mine.py (1)
  • register (751-752)
grail/cli/train.py (1)
  • register (419-420)
grail/cli/validate.py (1)
  • register (232-233)
grail/environments/sat.py (2)
grail/mining/rollout_generator.py (3)
  • RolloutGenerator (102-379)
  • reset_environment (334-336)
  • parse_action (344-346)
grail/environments/base.py (3)
  • Parser (7-22)
  • RewardVector (25-129)
  • parse (11-22)
grail/grail.py (2)
grail/environments/sat.py (1)
  • generate_sat_problem (63-93)
grail/monitoring/manager.py (1)
  • get_monitoring_manager (338-347)
grail/cli/validate.py (3)
grail/cli/mine.py (2)
  • parse_filename (82-94)
  • parse_window_filename (97-109)
grail/cli/train.py (2)
  • parse_filename (65-77)
  • parse_window_filename (80-92)
grail/infrastructure/chain.py (1)
  • GrailChainManager (17-298)
grail/__init__.py (4)
grail/environments/sat.py (2)
  • create_sat_reward_vector (281-334)
  • generate_sat_problem (63-93)
grail/infrastructure/comms.py (10)
  • download_file_chunked (371-483)
  • download_from_huggingface (996-1051)
  • file_exists (525-547)
  • get_file (568-586)
  • get_valid_rollouts (754-785)
  • list_bucket_files (550-565)
  • login_huggingface (793-820)
  • sink_window_inferences (594-622)
  • upload_file_chunked (204-297)
  • upload_to_huggingface (823-993)
grail/infrastructure/drand.py (2)
  • get_drand_beacon (87-137)
  • get_round_at_time (170-173)
grail/mining/rollout_generator.py (1)
  • RolloutGenerator (102-379)
🪛 YAMLlint (1.37.1)
.github/workflows/docker-publish.yml

[error] 33-33: trailing spaces

(trailing-spaces)


[error] 39-39: trailing spaces

(trailing-spaces)


[error] 42-42: trailing spaces

(trailing-spaces)


[error] 45-45: trailing spaces

(trailing-spaces)


[error] 59-59: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 70-70: trailing spaces

(trailing-spaces)


[error] 83-83: trailing spaces

(trailing-spaces)


[error] 94-94: trailing spaces

(trailing-spaces)


[error] 97-97: trailing spaces

(trailing-spaces)


[error] 100-100: trailing spaces

(trailing-spaces)


[error] 117-117: trailing spaces

(trailing-spaces)


[error] 124-124: trailing spaces

(trailing-spaces)


[error] 130-130: no new line character at the end of file

(new-line-at-end-of-file)

docker/docker-compose.validator.yml

[error] 16-16: trailing spaces

(trailing-spaces)


[error] 21-21: trailing spaces

(trailing-spaces)


[error] 31-31: trailing spaces

(trailing-spaces)


[error] 41-41: trailing spaces

(trailing-spaces)


[error] 46-46: trailing spaces

(trailing-spaces)


[error] 52-52: trailing spaces

(trailing-spaces)


[error] 58-58: trailing spaces

(trailing-spaces)


[error] 64-64: trailing spaces

(trailing-spaces)


[error] 72-72: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 83-83: trailing spaces

(trailing-spaces)


[error] 109-109: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/ci.yml

[error] 15-15: trailing spaces

(trailing-spaces)


[error] 21-21: trailing spaces

(trailing-spaces)


[error] 24-24: trailing spaces

(trailing-spaces)


[error] 27-27: trailing spaces

(trailing-spaces)


[error] 30-30: trailing spaces

(trailing-spaces)


[error] 42-42: trailing spaces

(trailing-spaces)


[error] 48-48: trailing spaces

(trailing-spaces)


[error] 51-51: trailing spaces

(trailing-spaces)


[error] 54-54: trailing spaces

(trailing-spaces)


[error] 58-58: no new line character at the end of file

(new-line-at-end-of-file)

🪛 dotenv-linter (3.3.0)
docker/.env.validator.example

[warning] 10-10: [UnorderedKey] The WALLET_HOTKEY key should go before the WALLET_NAME key

(UnorderedKey)


[warning] 15-15: [UnorderedKey] The R2_BUCKET_ID key should go before the R2_ENDPOINT_URL key

(UnorderedKey)


[warning] 16-16: [UnorderedKey] The R2_ACCOUNT_ID key should go before the R2_BUCKET_ID key

(UnorderedKey)


[warning] 19-19: [UnorderedKey] The R2_READ_ACCESS_KEY_ID key should go before the R2_WRITE_ACCESS_KEY_ID key

(UnorderedKey)


[warning] 20-20: [UnorderedKey] The R2_READ_SECRET_ACCESS_KEY key should go before the R2_WRITE_ACCESS_KEY_ID key

(UnorderedKey)


[warning] 53-53: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

🪛 markdownlint-cli2 (0.17.2)
docs/VALIDATOR_DEPLOYMENT.md

230-230: Bare URL used

(MD034, no-bare-urls)


231-231: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (116)
scripts/test-s3-credentials.py (2)

5-5: Formatting improvement with blank line

Adding a blank line after the imports section improves code readability and follows Python style conventions.


21-21: Code formatting improvements maintain functionality

The formatting changes (single-line environment variable access, consolidated print statements, and keyword argument formatting) improve readability while maintaining identical functionality. These changes align with the broader code hygiene improvements in this PR.

Also applies to: 37-37, 53-53, 61-61, 76-76, 79-79, 104-104

grail/environments/base.py (3)

4-4: LGTM! Type hint modernization is consistent.

The update from List and Tuple to built-in generics (list, tuple) aligns with Python 3.9+ capabilities and the broader PR modernization effort.


30-33: Type hints correctly updated to built-in generics.

The parameter type annotations properly use list[Callable[[Any, Any], float]], list[float], and Optional[list[tuple[float, float]]] following PEP 585.


77-77: Return type annotations consistently modernized.

Both compute_individual_rewards and reward_bounds methods now use built-in generics for their return types (list[float] and tuple[float, float] respectively).

Also applies to: 104-104

.github/workflows/docker-publish.yml (1)

30-30: Verify the condition logic for skipping tests.

The condition if: github.event.inputs.skip_tests != 'true' works correctly for workflow dispatch events. However, for release events where github.event.inputs is undefined, this will evaluate to true and run tests (which is the desired behavior).

grail/monitoring/config.py (1)

23-23: Type annotations successfully modernized to built-in generics.

All method signatures have been updated to use dict[str, Any] and str | None consistently, following PEP 585 and PEP 604 standards.

Also applies to: 56-56, 87-87, 118-118, 171-171, 193-193

scripts/run_integration_experiment.py (4)

25-34: Import organization improved.

The imports are now better organized with standard library imports (argparse, logging, subprocess, sys) properly grouped.


48-56: Type annotations modernized to built-in generics.

Instance variables and method return types correctly updated to use dict, list, and union syntax (str | None).

Also applies to: 62-62


270-272: Parameter type hints properly updated.

The run method signature now uses built-in generics: list[str] for miner_models and Optional[list[str]] for miner_hotkeys.


402-402: Help text improvements are clearer.

The updated help strings for --hotkeys and --validator-delay are more concise and properly formatted.

Also applies to: 410-410

grail/drand.py (3)

12-20: Import cleanup and type modernization complete.

Successfully removed unused typing imports (Dict, Optional, Tuple) and retained only Any as needed. Added from __future__ import annotations for forward compatibility.


55-55: Module-level type annotations updated consistently.

Global variables and cache now use built-in generics: dict[str, dict[str, Any]] for DRAND_CHAINS, str | None for chain parameters, and proper dict typing for the cache.

Also applies to: 75-77, 80-80


95-95: All function signatures modernized to built-in generics.

Public and private function signatures consistently updated to use dict[str, Any], tuple[...], and int | None throughout the module.

Also applies to: 101-103, 130-130, 156-156, 249-249, 266-266, 320-320, 345-345, 382-382

docker/docker-compose.validator.yml (1)

23-29: R2 credential validation present — no action required

Validator loads/checks dual and single R2 credential sets and logs an explicit, actionable error when missing. See grail/infrastructure/credentials.py (checks around lines 55–89; error handling/log message around lines 111–114).

grail/cli/mine.py (3)

15-15: Type imports properly modernized.

Successfully removed List and Tuple from typing imports, keeping only Any and Optional as needed.


84-84: Function signatures consistently updated to built-in generics.

All function return types and parameter types now use tuple[...] and list[...] instead of the typing module equivalents.

Also applies to: 99-99, 252-252, 288-288, 363-363, 370-370, 489-489, 534-534, 558-558, 271-271


335-335: Log message formatting improvements.

The updated log messages with improved formatting (using parentheses for block ranges and consistent structure) enhance readability.

Also applies to: 600-600, 658-658

grail/monitoring/backends/__init__.py (1)

9-10: Import order change is reasonable

The reordering of imports (moving WandBBackend after NullBackend) doesn't affect functionality and maintains consistency with the __all__ declaration order.

tests/test_basics.py (1)

3-8: Verify import path and organization

The new import of MonitoringConfig and the reorganization of constant imports look good. The placement of WINDOW_LENGTH within the constants import is consistent.

grail/monitoring/__init__.py (1)

8-9: Good modernization of imports

The reordering of base imports and addition of MonitoringConfig to the public API improves the module interface. This makes MonitoringConfig properly accessible from grail.monitoring, which is needed by the new test.

grail/environments/__init__.py (1)

7-14: Proper consolidation of SAT-related exports

The addition of SATProblem, SATRolloutGenerator, and create_sat_reward_vector to the imports properly exposes these components through the environments module, making the API more complete and consistent.

grail/infrastructure/drand.py (2)

4-6: Appropriate type hint modernization

The typing imports are correctly updated to use modern Python typing with built-in generics. The addition of os import is necessary for the os.urandom() call on line 144.


20-20: Consistent use of built-in generics

The changes from Dict[str, ChainConfig] to dict[str, ChainConfig] and similar for the return type of get_current_chain() follow Python 3.9+ typing conventions consistently.

Also applies to: 82-82

grail/monitoring/backends/null_backend.py (2)

11-16: Proper typing modernization with correct imports

The addition of Generator from collections.abc and the modernization to use built-in generics (dict, list) is correct and follows Python 3.9+ conventions.


26-26: All method signatures properly updated

The type hints have been consistently updated across all methods to use built-in generics (dict[str, Any], list[MetricData], dict[str, str] | None), maintaining API compatibility while modernizing the typing.

Also applies to: 42-42, 51-51, 73-73

grail/cli/__init__.py (3)

14-14: Good defensive loading of dotenv

Moving load_dotenv to the top and wrapping it in a try-except block ensures environment variables are loaded early without risking startup failures if dotenv is unavailable or encounters errors.

Also applies to: 24-28


92-133: Robust monitoring initialization with proper error handling

The new _initialize_monitoring function provides excellent separation of concerns and proper error handling:

  • Checks if monitoring is enabled before initialization
  • Adjusts configuration based on verbosity level
  • Validates configuration and falls back to null backend on errors
  • Ensures monitoring failures don't crash the CLI

This is a well-structured implementation that follows defensive programming principles.


183-191: Clean type annotation modernization

The updated type annotation using the union operator (Callable[[typer.Typer], None] | None) is cleaner and more readable than the Optional wrapper.

grail/__init__.py (2)

18-20: LGTM! The imports are correctly organized.

The two imports create_sat_reward_vector and generate_sat_problem have been properly added from the environments module, maintaining consistency with the public API exports declared in __all__.


21-43: LGTM! Module imports are well organized.

The reorganization of imports from grail, infrastructure.comms, infrastructure.drand, and mining.rollout_generator modules is clean and follows a logical structure. The imports align with the module's public API as specified in __all__.

grail/monitoring/base.py (3)

12-16: LGTM! Import modernization looks good.

The migration from typing module imports to using collections.abc.Generator and built-in generics aligns with PEP 585 recommendations for Python 3.9+.


35-38: LGTM! Type annotations modernized correctly.

The update from Optional[Dict[str, str]] to dict[str, str] | None follows Python 3.10+ conventions for union types using the pipe operator.


99-100: @contextmanager decorator removed but method still yields

The timer method signature has been updated to use built-in generics, but the @contextmanager decorator on line 99 has been removed. This decorator is essential for the method to function correctly as a context manager when used with with statements.

Apply this fix to restore the decorator:

     @abstractmethod
-    @contextmanager
+    @contextmanager
     def timer(self, name: str, tags: dict[str, str] | None = None) -> Generator[None, None, None]:

Likely an incorrect or invalid review comment.

grail/infrastructure/comms.py (3)

120-129: LGTM! Type annotation modernization.

The update from Dict[str, Any] to dict[str, Any] for s3_config follows the modern Python typing conventions.


220-221: LGTM! Improved formatting of log messages.

The consolidation of multi-line f-strings into single lines improves readability while maintaining the same functionality.

Also applies to: 270-271


339-339: LGTM! Consistent type annotation updates across function signatures.

All function signatures have been properly updated to use built-in generics (list[dict], dict[str, Any]) instead of typing module imports. This modernization is consistent throughout the file.

Also applies to: 554-554, 572-572, 597-597, 728-728, 758-758, 824-824, 1000-1000

grail/grail.py (4)

167-170: LGTM! Type annotation improvements for cache handling.

The type annotations for the cache have been properly updated to use built-in generics dict[tuple[str, int], torch.Tensor], and the cache update logic is correct.

Also applies to: 208-209


261-263: LGTM! Improved exception chaining.

The addition of from e properly chains the exception, preserving the original error context for better debugging.


650-651: LGTM! Log message formatting improvement.

The consolidation of the log message into a single f-string improves readability.


577-577: LGTM! Consistent type annotation updates.

All method signatures have been properly updated to use built-in generics (tuple[bool, dict[str, bool]], dict[str, bool], list[int]) following modern Python conventions.

Also applies to: 596-597, 1130-1130

grail/mining/rollout_generator.py (4)

6-6: LGTM! Import organization improvements.

The imports have been properly reorganized with typing.Any for type hints and torch import moved to its appropriate location.

Also applies to: 10-10


81-82: LGTM! Type annotations modernized in GRPORollout dataclass.

All type annotations have been properly updated to use built-in generics (list[int], list[float], list[tuple[Any, Any, float]], dict) instead of typing module imports.

Also applies to: 93-93, 97-97, 99-99


254-254: LGTM! Log message formatting improvement.

The warning message has been consolidated into a single line for better readability.


143-143: LGTM! Consistent type annotation updates across all methods.

All method signatures have been properly updated to use built-in generics (list[GRPORollout], list[torch.Tensor], list[int], list[float], dict) following modern Python conventions.

Also applies to: 300-300, 314-314, 339-339, 354-354, 367-367

grail/monitoring/manager.py (10)

31-31: LGTM: Type modernization applied correctly

The union type syntax modernization from Optional[MonitoringBackend] to MonitoringBackend | None aligns with PEP 604 and Python 3.10+ standards.


38-38: LGTM: Proper type annotation for metric buffer

The buffer is correctly typed as list[MetricData] using built-in generics.


41-46: LGTM: Consistent type modernization for optional attributes

All optional attributes are properly typed using the modern union syntax (X | None).


49-49: LGTM: Correct parameter type annotation

The config parameter is properly typed as dict[str, Any] using built-in generics.


103-103: LGTM: Proper type annotation for window_number parameter

The optional window_number parameter is correctly typed as int | None.


116-117: LGTM: Consistent type modernization for numeric parameters

The value and tags parameters are properly typed using modern union syntax.


148-149: LGTM: Consistent type annotations for gauge method

Parameters are properly typed using built-in generics and union syntax.


176-176: LGTM: Proper type annotation for histogram method

The tags parameter is correctly typed as dict[str, str] | None.


268-268: LGTM: Proper type annotation for start_run method

The config parameter is correctly typed as optional dict[str, Any] | None.


335-335: LGTM: Consistent type modernization for global variable

The global _monitoring_manager is properly typed as MonitoringManager | None.

grail/monitoring/backends/wandb_backend.py (14)

13-13: LGTM: Proper import of Generator type

The Generator type is correctly imported from collections.abc for proper typing of context managers.


19-19: LGTM: Import reordering for consistency

The imports are properly reorganized with base imports following a logical order.


35-35: LGTM: Type modernization for config attribute

The config attribute is properly typed as dict[str, Any] using built-in generics.


39-41: LGTM: Consistent type modernization for optional attributes

The _start_time and _tables attributes are properly typed using modern syntax.


43-43: LGTM: Proper parameter type annotation

The config parameter is correctly typed as dict[str, Any].


180-180: LGTM: Proper type annotation for metrics list

The metrics parameter is correctly typed as list[MetricData].


197-197: LGTM: Explicit type annotation for data dictionary

The data variable is explicitly typed as dict[str, Any] for clarity.


220-220: LGTM: Consistent return type annotation

The return type is properly typed as dict[str, Any].


260-260: LGTM: Proper type annotation for array_like variable

The array_like variable is correctly typed as np.ndarray | None.


280-280: LGTM: Proper return type annotation

The return type is correctly typed as float | None.


296-296: LGTM: Consistent parameter type annotations

Both data and metric parameters are properly typed using built-in generics.


324-324: LGTM: Proper return type for context manager

The timer method correctly returns Generator[None, None, None] to match the context manager protocol.


371-416: Well-designed text artifact handling

The implementation for handling text artifacts with persistent wandb Tables is well thought out. Using a defined schema with stable columns and MUTABLE mode for accumulation is a good design choice.


427-427: LGTM: Proper parameter type annotation

The config parameter is correctly typed as dict[str, Any].

grail/environments/sat.py (15)

5-7: LGTM: Proper imports for built-in types

The imports are correctly updated to use built-in types directly rather than importing from typing.


27-27: LGTM: Type modernization for SATProblem constructor

The clauses parameter is properly typed as list[list[int]] using built-in generics.


38-38: Minor: Unused loop variable

The loop variable _i is prefixed with underscore to indicate it's unused, which is a good practice.


43-43: LGTM: Proper parameter type annotation

The assignment parameter is correctly typed as list[bool].


120-120: LGTM: Consistent return type annotation

The parse method return type is properly typed as dict[str, Any].


179-179: LGTM: Explicit type annotation for parsed dictionary

The parsed variable is explicitly typed as dict[str, Any] for clarity.


191-191: Minor formatting improvement in logging

The f-string formatting is cleaner than the previous style with explicit string conversion.


201-201: LGTM: Consistent type annotations

Both the parameter and return types are properly typed using built-in generics.


294-294: LGTM: Proper type casting for reward functions

The cast to list[Callable[[Any, Any], float]] is correctly typed using built-in generics.


373-373: LGTM: Consistent return type annotation

The return type is properly typed as dict[str, Any].


385-386: LGTM: Proper parameter type annotations

Both state and trajectory parameters are correctly typed using built-in generics.


397-397: LGTM: Consistent parameter types

The state parameter is properly typed as dict[str, Any].


404-404: Good practice: Direct attribute assignment

Using direct attribute assignment (self._last_completion_text = text) is cleaner than using setattr.


419-420: LGTM: Proper type annotations for step_environment

Both parameters and return type are correctly typed using built-in generics.


488-489: LGTM: Consistent type annotations

The parameters and return type are properly typed using built-in generics.

grail/cli/validate.py (24)

16-16: LGTM: Simplified imports

The import statement is simplified by removing unnecessary typing imports since built-in types are now used directly.


162-162: LGTM: Proper return type annotation

The return type is correctly typed as tuple[Optional[str], Optional[int], Optional[int]].


175-175: LGTM: Consistent return type annotation

The return type is properly typed as tuple[Optional[str], Optional[int]].


222-223: LGTM: Type modernization for global variable

The miner_inference_counts global is properly typed as defaultdict[str, list].


303-303: LGTM: Proper return type annotation

The return type is correctly typed as tuple[float, float].


341-343: LGTM: Complex nested type properly annotated

The inference_counts variable is correctly typed as defaultdict[str, defaultdict[int, dict[str, int]]].


357-357: LGTM: Simplified dictionary creation

Using dict(zip(...)) is cleaner than the previous list comprehension approach.


438-441: LGTM: Cleaner f-string formatting

The f-string formatting is more readable than the previous style.


629-629: LGTM: Proper return type annotation

The return type is correctly typed as tuple[Any, GrailChainManager].


689-689: LGTM: Proper return type annotation

The return type is correctly typed as list[str].


692-692: LGTM: Cleaner f-string formatting

The f-string formatting improves readability.


702-704: LGTM: Proper type annotations for function signature

All parameters and return type are correctly typed using built-in generics.


723-726: Verify the weighting logic change

The base_score calculation currently only uses unique_score (coefficient 1.0) while successful_score and estimated_valid_score have zero coefficients. This appears intentional based on the comment at line 725, but please confirm this is the desired behavior for the current release.


737-738: LGTM: Proper type annotations

All parameters and return type are correctly typed using built-in generics.


758-765: LGTM: Complex return type properly annotated

The complex tuple return type is correctly typed with all nested generic types.


835-840: LGTM: Proper type annotations for function parameters

All parameters are correctly typed using built-in generics including defaultdict.


868-874: LGTM: Cleaner f-string formatting

The f-string formatting improves readability in warning messages.


927-927: LGTM: Simplified sorting

The direct use of sorted() with a lambda key function is clean and readable.


937-937: LGTM: Improved formatting precision

Using {SAMPLE_RATE * 100:.0f}% provides cleaner percentage display.


1236-1236: LGTM: Cleaner set display in logging

Using {set(base_seeds)} provides a cleaner display of unique values.


1286-1290: LGTM: Proper type annotations for weights computation

All parameters and return type are correctly typed using built-in generics.


1310-1310: LGTM: Explicit type annotation for empty metrics

The EMPTY_METRICS variable is explicitly typed as dict[str, int] for clarity.


1368-1368: LGTM: Proper type annotation for upload function

The all_valid_rollouts parameter is correctly typed as list[dict].


1383-1383: LGTM: Cleaner f-string formatting

The f-string formatting improves readability in the log message.

grail/shared/hf_compat.py (1)

3-3: Correct HF typing import.

PretrainedConfig casing is correct; import looks good.

grail/infrastructure/network.py (1)

10-10: Typing modernization LGTM.

tuple[...] return annotation change is correct and non-breaking.

grail/cli/train.py (3)

65-68: Type-hint modernization LGTM.

tuple[...] annotations are correct and non-breaking.


80-83: Type-hint modernization LGTM.

Consistent with PEP 585.


271-273: Reward logging OK; ensure non-empty rewards.

Given earlier guards, this is fine.

grail/infrastructure/chain.py (3)

55-57: Built-in generics for commitments.

Annotation update to dict[int, Bucket] is correct and consistent with the repo.


167-176: Return typing modernization LGTM.

Signature and docstring align; function returns dict on all paths.


267-275: get_all_buckets typing/return LGTM.

Consistent use of built-in generics.

- Add GRAIL_BURN_UID and GRAIL_BURN_PERCENTAGE environment variables
- Implement burn logic in weight computation
- When configured, allocates specified percentage of emissions to burn UID
- Scales down all other weights proportionally
- Updates docker-compose and documentation

Example: GRAIL_BURN_UID=0 GRAIL_BURN_PERCENTAGE=10 burns 10% of emissions
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
docker/docker-compose.validator.yml (4)

69-75: Healthcheck may fail: pgrep might be missing and exact match is brittle.

Use CMD-SHELL and a more forgiving pattern; or a grail health subcommand if available. Verify pgrep exists in the image.

-    healthcheck:
-      test: ["CMD", "pgrep", "-f", "grail validate"]
+    healthcheck:
+      test: ["CMD-SHELL", "pgrep -f 'grail.*validate' >/dev/null 2>&1 || exit 1"]
       interval: 60s
       timeout: 10s
       retries: 3
       start_period: 30s

87-92: 30s Watchtower polling is aggressive.

Consider 300–900s to reduce registry load and rate-limit risk. Functionality unchanged.

-    command: --interval 30 --cleanup --label-enable
+    command: --interval 600 --cleanup --label-enable

11-49: Trailing spaces and missing EOF newline.

Strip trailing spaces (see YAMLlint at Lines 16, 21, 29, 39, 44, 51, 57, 65, 68, 76) and add a newline at end of file to keep linters green.

Also applies to: 51-57, 65-68, 76-76, 98-102


52-57: Nice: wallet mount is read-only. Consider extra hardening.

Optional hardening that usually works with CUDA containers: drop caps and set no-new-privileges.

   validator:
@@
     volumes:
       - ${HOME}/.bittensor:/root/.bittensor:ro
       - validator-cache:/root/.cache
       - validator-data:/app/data
+    security_opt:
+      - no-new-privileges:true
+    cap_drop:
+      - ALL
docs/VALIDATOR_DEPLOYMENT.md (5)

65-73: Docs and compose disagree on GPU config.

Compose file used with docker compose needs gpus: all (Swarm-only deploy.devices won’t apply). Add a note and/or snippet so users on docker compose get GPU access.

Also applies to: 138-145


165-166: CLI inconsistency: docker compose vs docker-compose.

Standardize on docker compose (plugin) throughout.

-docker-compose -f docker/docker-compose.validator.yml restart validator
+docker compose -f docker/docker-compose.validator.yml restart validator

171-175: ps template may not expose .State consistently.

Consider showing only Status, or explicitly document expected output.

-docker ps --format "table {{.Names}}\t{{.Status}}\t{{.State}}"
+docker ps --format "table {{.Names}}\t{{.Status}}"

234-235: Replace bare URLs with Markdown links.

Keeps markdownlint happy.

-- GitHub Issues: https://github.com/tplr-ai/grail/issues
-- Documentation: https://github.com/tplr-ai/grail/blob/main/SPEC.md
+- GitHub Issues: [tplr-ai/grail issues](https://github.com/tplr-ai/grail/issues)
+- Documentation: [SPEC.md](https://github.com/tplr-ai/grail/blob/main/SPEC.md)

198-209: Optional: provide port examples if “external axon” requires exposure.

Section mentions ports but compose doesn’t expose any. Add a snippet for users who need it.

# Example (only if your validator exposes an external service)
services:
  validator:
    ports:
      - "8080:8080"  # replace with actual health/axon port
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0ad846 and b212fad.

📒 Files selected for processing (3)
  • docker/README.md (1 hunks)
  • docker/docker-compose.validator.yml (1 hunks)
  • docs/VALIDATOR_DEPLOYMENT.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker/README.md
🧰 Additional context used
🪛 YAMLlint (1.37.1)
docker/docker-compose.validator.yml

[error] 16-16: trailing spaces

(trailing-spaces)


[error] 21-21: trailing spaces

(trailing-spaces)


[error] 29-29: trailing spaces

(trailing-spaces)


[error] 39-39: trailing spaces

(trailing-spaces)


[error] 44-44: trailing spaces

(trailing-spaces)


[error] 51-51: trailing spaces

(trailing-spaces)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 65-65: trailing spaces

(trailing-spaces)


[error] 68-68: trailing spaces

(trailing-spaces)


[error] 76-76: trailing spaces

(trailing-spaces)


[error] 102-102: no new line character at the end of file

(new-line-at-end-of-file)

🪛 markdownlint-cli2 (0.17.2)
docs/VALIDATOR_DEPLOYMENT.md

234-234: Bare URL used

(MD034, no-bare-urls)


235-235: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Tests (3.11)

Comment on lines +58 to +65
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

GPU reservation won’t be honored by docker compose; add gpus: all (or pin to Swarm).

The deploy.resources.devices block is only effective with Swarm (docker stack). Your docs use docker compose, so the container likely won’t get a GPU. Prefer the compose-native gpus field; keep deploy only if you also support Swarm.

Apply:

   validator:
     image: ghcr.io/tplr-ai/grail:latest
+    gpus: "all"
@@
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: 1
-              capabilities: [gpu]
+    # Optional: keep the Swarm block if you also deploy via `docker stack`
+    # deploy:
+    #   resources:
+    #     reservations:
+    #       devices:
+    #         - driver: nvidia
+    #           count: 1
+    #           capabilities: [gpu]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
validator:
image: ghcr.io/tplr-ai/grail:latest
gpus: "all"
# Optional: keep the Swarm block if you also deploy via `docker stack`
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 65-65: trailing spaces

(trailing-spaces)

🤖 Prompt for AI Agents
In docker/docker-compose.validator.yml around lines 58 to 65, the compose file
uses deploy.resources.devices which only works with Swarm and will not grant a
GPU when using docker compose; replace or complement that block with the
compose-native gpus setting (e.g., add gpus: all or gpus: '"device=0"' under the
service definition) so the GPU is available in docker compose, and keep
deploy.resources.devices only if you also support Swarm deployments.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
grail/__init__.py (2)

23-35: Public API: imports not reflected in all (PROTOCOL_VERSION, RolloutGenerator).

You import these at module scope but don’t list them in all. If they’re intended as public symbols, add them; otherwise drop the imports to avoid accidental exposure.

Apply to include both:

 __all__ = [
   # Core classes
   "Prover",
   "Verifier",
+  "PROTOCOL_VERSION",
   # New reward system
   "Parser",
   "RewardVector",
   "SATParser",
   "create_sat_reward_vector",
   # Existing SAT classes
   "SATProblem",
   "generate_sat_problem",
   "SATRolloutGenerator",
+  "RolloutGenerator",
   # Entry points
   "main",
 ]

Alternative: if they’re not public, remove the imports:

-from .infrastructure.comms import (
-    PROTOCOL_VERSION,
-    ...
-)
...
-from .mining.rollout_generator import RolloutGenerator  # noqa: F401, E402, E501, F403, F405
+from .infrastructure.comms import (... without PROTOCOL_VERSION ...)
+# (and drop the RolloutGenerator import)

Also applies to: 43-43, 47-62


10-20: Nit: narrow noqa disables.

You’re not using import-star, so F403/F405 are unnecessary. Prefer minimal, targeted noqa to keep lints meaningful.

For example:

-from .environments import (  # noqa: F401, E402, E501, F403, F405
+from .environments import (  # noqa: F401, E402, E501
...
-from .grail import Prover, Verifier  # noqa: F401, E402, E501, F403, F405
+from .grail import Prover, Verifier  # noqa: F401, E402, E501
...
-from .infrastructure.comms import (  # noqa: F401, E402, E501, F403, F405
+from .infrastructure.comms import (  # noqa: F401, E402, E501
...
-from .infrastructure.drand import (  # noqa: F401, E402, E501, F403, F405
+from .infrastructure.drand import (  # noqa: F401, E402, E501
...
-from .mining.rollout_generator import RolloutGenerator  # noqa: F401, E402, E501, F403, F405
+from .mining.rollout_generator import RolloutGenerator  # noqa: F401, E402, E501

Also applies to: 21-21, 23-35, 39-42

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b212fad and 3d20da9.

📒 Files selected for processing (1)
  • grail/__init__.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
grail/__init__.py (6)
grail/environments/sat.py (5)
  • SATParser (96-198)
  • SATProblem (24-60)
  • SATRolloutGenerator (337-523)
  • create_sat_reward_vector (281-334)
  • generate_sat_problem (63-93)
grail/grail.py (2)
  • Prover (505-541)
  • Verifier (547-1188)
grail/infrastructure/comms.py (11)
  • download_file_chunked (371-483)
  • download_from_huggingface (996-1051)
  • file_exists (525-547)
  • get_file (568-586)
  • get_valid_rollouts (754-785)
  • list_bucket_files (550-565)
  • login_huggingface (793-820)
  • sink_window_inferences (594-622)
  • upload_file_chunked (204-297)
  • upload_to_huggingface (823-993)
  • upload_valid_rollouts (726-751)
grail/drand.py (2)
  • get_drand_beacon (266-317)
  • get_round_at_time (364-379)
grail/infrastructure/drand.py (2)
  • get_drand_beacon (87-137)
  • get_round_at_time (170-173)
grail/mining/rollout_generator.py (1)
  • RolloutGenerator (102-379)
🔇 Additional comments (3)
grail/__init__.py (3)

18-20: LGTM: exposing SAT helpers in top-level API.

Good call surfacing create_sat_reward_vector and generate_sat_problem here.


21-21: LGTM: Prover/Verifier re-export.

This makes the common entry points easy to discover.


2-2: Confirm version reset; prefer forward-only bump per PEP 440.

pyproject.toml uses hatch to read version from grail/init.py; version is currently "0.0.1" (grail/init.py:2) and there are no git tags — don't decrease the package version; bump to the next dev release.

-__version__ = "0.0.1"
+__version__ = "0.1.0.dev1"

Comment on lines +39 to +42
from .infrastructure.drand import (
get_drand_beacon,
get_round_at_time,
) # noqa: F401, E402, E501, F403, F405
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Unify DRAND API source to avoid inconsistent semantics.

Both grail/drand.py and grail/infrastructure/drand.py define get_drand_beacon/get_round_at_time with different edge-case handling (e.g., timestamp < genesis). Pick one canonical module for the public surface and ensure behavior is consistent.

Apply if grail/drand.py is the canonical API:

-from .infrastructure.drand import (
-    get_drand_beacon,
-    get_round_at_time,
-)  # noqa: F401, E402, E501, F403, F405
+from .drand import get_drand_beacon, get_round_at_time  # noqa: F401, E402

Quick repo check for duplicate definitions and differing logic:


🏁 Script executed:

#!/bin/bash
rg -nC2 -g 'grail/**/drand*.py' -e 'def get_drand_beacon\(|def get_round_at_time\('
rg -nC3 -g 'grail/**/drand*.py' -e 'genesis|GENESIS|period|PERIOD|timestamp'

Length of output: 13865


Unify DRAND public API to grail.drand and fix inconsistent round calculation.

  • Replace public export in grail/init.py with the canonical module:
-from .infrastructure.drand import (
-    get_drand_beacon,
-    get_round_at_time,
-)  # noqa: F401, E402, E501, F403, F405
+from .drand import get_drand_beacon, get_round_at_time  # noqa: F401, E402
  • Fix grail/infrastructure/drand.py::get_round_at_time to match canonical semantics (t == genesis -> 1; t < genesis -> 0). Current implementation can produce negative rounds. Example replacement:
def get_round_at_time(timestamp: int) -> int:
    if timestamp < DRAND_GENESIS_TIME:
        return 0
    return 1 + (timestamp - DRAND_GENESIS_TIME) // DRAND_PERIOD
🤖 Prompt for AI Agents
In grail/__init__.py around lines 39 to 42, the package currently exports DRAND
symbols from grail.infrastructure.drand which creates an inconsistent public
API; change the public export to re-export the canonical grail.drand module
(i.e. expose grail.drand as the public DRAND API), and in
grail/infrastructure/drand.py update get_round_at_time so it never returns
negative rounds by returning 0 when timestamp < DRAND_GENESIS_TIME and otherwise
computing 1 + (timestamp - DRAND_GENESIS_TIME) // DRAND_PERIOD to ensure t ==
genesis yields round 1.

erfanMhi and others added 6 commits September 12, 2025 15:45
- Add GRAIL_BURN_UID and GRAIL_BURN_PERCENTAGE environment variables
- Implement burn logic in weight computation
- When configured, allocates specified percentage of emissions to burn UID
- Scales down all other weights proportionally
- Updates docker-compose and documentation

Example: GRAIL_BURN_UID=0 GRAIL_BURN_PERCENTAGE=10 burns 10% of emissions
feat: add emission burn mechanism for validators
@erfanMhi erfanMhi merged commit d1d3ed1 into main Sep 12, 2025
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants