Add a generic ViT model to BioNeMo Recipes. by cspades · Pull Request #1103 · NVIDIA/bionemo-framework

cspades · 2025-09-02T22:15:42Z

Description

Demonstrate how to use Megatron-FSDP + TransformerEngine on a generic ViT model adapted from the TIMM library, check it out here: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py
OSRB NVBug No. 5499906 ✅

TODO

TransformerEngine CP
- Need to update the ViT to be entirely sequence parallel after the patch tokenization.
DCP <> HF Checkpointing
- https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/
Performance
- Have not benchmarked this on large models to show improvement vs. FSDP or vanilla Transformer. For small models, TE seems slower than a vanilla Transformer due to overhead likely.

Usage

Build and run the Docker container:

docker run -it --rm --gpus=all --shm-size=16G gitlab-master.nvidia.com:5005/clara-discovery/bionemo-recipes/recipes/vit:timm

Run training (on 2 GPUs):

torchrun --nproc-per-node 2 train.py --config-name vit_te_base_patch16_224 distributed.dp_shard=2

Run inference (on 1 GPU):

# Convert DCP checkpoint to torch.save format.
python -m torch.distributed.checkpoint.format_utils dcp_to_torch step_25_loss_2.533 torch_ckpt_test.pt
# Update inference.checkpoint.path=torch_ckpt_test.pt in config and run inference.
torchrun --nproc-per-node 1 infer.py --config-name=vit_te_base_patch16_224

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest
INCLUDE_SLOW_TESTS - Execute tests labelled as slow in pytest for extensive testing

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- ViT recipe: configurable Vision Transformer with training and inference entry points, distributed Megatron-FSDP support, optional Transformer Engine, and DCP↔Torch checkpointing with auto-resume.
- Beans dataset pipeline and infinite dataloader helper.
Documentation
- New README with setup (Docker/PIP), training/inference commands, checkpointing & conversion guidance.
Tests
- End-to-end training and inference tests validating configs, checkpointing, and auto-resume.
Chores
- Added dependencies, lint config, license/third-party notices, .gitignore patterns, and increased CI shared memory.

coderabbitai · 2025-09-02T22:15:49Z

Walkthrough

Adds a complete ViT recipe under recipes/vit: model implementation, Hydra configs (including TransformerEngine variants), distributed initializer, training/inference scripts, DCP↔Torch checkpoint utilities with auto-resume, dataset loader, tests, Dockerfile, per-recipe lint/.gitignore, and license additions.

Changes

Cohort / File(s)	Summary
Repo hygiene & licensing `recipes/vit/.gitignore`, `recipes/vit/.ruff.toml`, `recipes/vit/LICENSE`, `LICENSE/third_party.txt`	Add per-recipe .gitignore entries for runtime artifacts, add recipe-level Ruff config, add Apache‑2.0/MIT license file for the recipe, and append multiple third‑party license blocks.
Container & dependencies `recipes/vit/Dockerfile`, `recipes/vit/requirements.txt`	Add Dockerfile (nvcr.io/nvidia/pytorch base) with mounts for netrc/pip cache and pip install step; add recipe requirements including torch, torchvision, transformer_engine, megatron-fsdp==0.1.0rc1, hydra-core, einops, wandb, tqdm, datasets.
Documentation `recipes/vit/README.md`	New README describing BioNeMo‑Vision ViT recipe: prerequisites, Docker/PIP instructions, training/inference examples, checkpoint conversion notes and caveats.
Hydra configs `recipes/vit/config/...` `recipes/vit/config/defaults.yaml`, `.../vit_base_patch16_224.yaml`, `.../vit_te_base_patch16_224.yaml`	Add Hydra defaults and two ViT configs (base and TE-enabled) with model hyperparameters, distributed/FSDP settings, training/inference, dataset, random seed, and profiling options.
Model implementation `recipes/vit/vit.py`	New comprehensive ViT implementation: embeddings, attention, blocks, pooling, patch dropout, initializers, utilities, optional TransformerEngine integration, and `build_vit_model` factory.
Training & utilities `recipes/vit/train.py`, `recipes/vit/checkpoint.py`, `recipes/vit/distributed.py`, `recipes/vit/beans.py`	Add Hydra-wrapped distributed training entrypoint with Megatron‑FSDP wrapping and auto-resume; checkpoint utilities for DCP↔Torch and auto-resume; DeviceMesh context-managed initializer; Beans dataset wrapper and infinite dataloader helper.
Inference `recipes/vit/infer.py`	Add Hydra-wrapped inference script: initialize DeviceMesh, build model, load Torch checkpoint (Megatron‑FSDP aware), run forward on a mock input.
Tests & CI `recipes/vit/test_train.py`, `recipes/vit/test_infer.py`, `.github/workflows/unit-tests-recipes.yml`	Add pytest modules for training and inference (including DCP→Torch conversion and auto-resume checks); increase test container shared memory with --shm-size=16G.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Hydra as Hydra (configs)
  participant Train as train.py
  participant Dist as initialize_distributed
  participant Model as build_vit_model
  participant Data as BeansDataset/Sampler
  participant FSDP as Megatron‑FSDP
  participant Ckpt as checkpoint.py

  User->>Hydra: run training config
  Hydra-->>Train: cfg
  Train->>Dist: init(dp_outer, dp_shard, cp, tp)
  Dist-->>Train: DeviceMesh
  Train->>Model: build_vit_model(cfg, DeviceMesh)
  Model-->>Train: ViT model
  Train->>FSDP: wrap(model, cfg.fsdp)
  Train->>Data: create loaders
  Train->>Ckpt: load_auto_resume_checkpoint(cfg, model, optim)
  alt resumed
    Ckpt-->>Train: resumed step_idx
  else fresh
    Ckpt-->>Train: None
  end
  loop training
    Train->>FSDP: forward/backward/step
    Train->>Ckpt: save_auto_resumable_checkpoint(...)
  end

sequenceDiagram
  autonumber
  actor User
  participant Hydra as Hydra (configs)
  participant Infer as infer.py
  participant Dist as initialize_distributed
  participant Model as build_vit_model
  participant Ckpt as load_torch_checkpoint

  User->>Hydra: run inference config
  Hydra-->>Infer: cfg
  Infer->>Dist: init(...)
  Dist-->>Infer: DeviceMesh
  Infer->>Model: build_vit_model(cfg, DeviceMesh)
  Infer->>Ckpt: load_torch_checkpoint(path, model, megatron_fsdp=cfg.inference.checkpoint.megatron_fsdp)
  Ckpt-->>Infer: model weights loaded
  Infer->>Model: forward(mock input)
  Model-->>Infer: logits

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

I thump a paw on silicon ground,
New patches, configs, checkpoints found,
Latents hum and layers gleam,
Shards align in FSDP dream—🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.13% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "Add a generic ViT model to BioNeMo Recipes." accurately and concisely summarizes the primary change: adding a ViT recipe and model (with configs, training/inference scripts, tests, and docs) under recipes/vit. It is a single clear sentence suitable for repository history and reviewer scanning.
Description Check	✅ Passed	The PR description follows the repository template and includes a detailed description of the changes, usage examples, the "Type of changes" marked as New feature, CI guidance, and a completed pre-submit checklist, so all required template sections are present and sufficiently informative to understand intent and usage. The included TODOs and example commands further clarify next steps and how to run the code.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cye/vit-recipe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (21)

recipes/vit/.ruff.toml (2)
3-3: Per-file ignore pattern likely too broad/non-matching

Key "tokenizer_auto" may not match any file. Use a glob with extension.
-[lint]
-per-file-ignores = { "tokenizer_auto" = ["ALL"] }
+[lint]
+per-file-ignores = { "**/tokenizer_auto.py" = ["ALL"] }
4-4: Avoid blanket disabling of core lint rules (F/E).

Disabling F (pyflakes) and E (pycodestyle) hides real bugs (e.g., undefined names). Narrow the ignore set and rely on the repo root for global policy.
-ignore = ["RUF","D","N","E","PLW","PERF","C","F"]
+ignore = ["RUF","D","N","PLW","PERF","C"]
If you need to suppress specific errors in this recipe, prefer targeted per-file ignores.
recipes/vit/.gitignore (1)
1-7: Add common ML/artifact ignores to keep the recipe tree clean.

Optional additions that frequently show up during runs.
 *.tar*
 *.zip
 checkpoints/
 wandb/
 outputs/
 __pycache__/
 .ruff_cache/
+lightning_logs/
+.ipynb_checkpoints/
+events.out.tfevents*
+*.ckpt
+*.pt
+*.pth
+nvlogs/
+.cache/
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt (1)

1-20: Confirm schema and coordinate convention in README.

File appears to be TSV: filename, xmin, ymin, xmax, ymax within [0,63]. Add a short README in this folder stating schema, range, and whether boxes are inclusive/exclusive to avoid off-by-one issues downstream.

I can draft the README if you confirm the convention (e.g., COCO-style with xmax/ymax exclusive).

recipes/vit/data/super-tiny-imagenet-5/train/n09193705/n09193705_boxes.txt (1)

1-1: Prefer TSV consistently (single delimiter).

Ensure a single TAB delimiter across all lines/files; mixed spaces/tabs will break simple parsers.
recipes/vit/Dockerfile (2)
3-6: Harden pip install and improve build portability.

Use python -m pip and --no-cache-dir.

Avoid relying solely on BuildKit binds for portability; fall back to copying requirements when not using BuildKit.

Apply:
-RUN --mount=type=secret,id=netrc,target=/root/.netrc \
-    --mount=type=cache,target=/root/.cache/pip \
-    --mount=type=bind,source=requirements.txt,target=/requirements.txt \
-    PIP_CONSTRAINT= pip install -r /requirements.txt
+RUN --mount=type=secret,id=netrc,target=/root/.netrc \
+    --mount=type=cache,target=/root/.cache/pip \
+    --mount=type=bind,source=requirements.txt,target=/requirements.txt \
+    PIP_CONSTRAINT= python -m pip install --no-cache-dir -r /requirements.txt
Optional portability fallback (outside BuildKit): copy requirements first to leverage layer cache, then install.

8-9: Reduce image size and avoid copying training data by default.

COPY . . will include the dataset into the image. Prefer a .dockerignore (or COPY only needed subdirs) and mount data at runtime.

Add recipes/vit/.dockerignore:
# Python/build
__pycache__/
*.pyc
.ruff_cache/

# VCS
.git
.gitignore

# Data and outputs
data/
outputs/
wandb/
checkpoints/
ckpts/
recipes/vit/README.md (7)
3-3: Avoid bare URL; use a Markdown link.

Conform to MD034 and improve readability.
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM), which you can check out here: https://github.com/huggingface/pytorch-image-models_
+_Adapted ViT model code from [huggingface/pytorch-image-models (TIMM)](https://github.com/huggingface/pytorch-image-models)._
5-11: Fix heading level increment (H2 after H1).

Matches MD001.
-### Pre-Requisites
+## Pre-requisites
11-14: Annotate code fences with language.

Satisfy MD040.
-```
+```bash
 docker build -t <image_repo>:<image_tag> .
---

`17-21`: **Annotate code fences with language.**

```diff
-```
+```bash
 # Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!
 docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
---

`26-31`: **Annotate code fences and tighten guidance about PIP_CONSTRAINT.**

Add bash and clarify optional env var usage.

```diff
-```
+```bash
 cd recipes/vit
-# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
+# If dependency resolution fails, you can unset constraints via: PIP_CONSTRAINT= pip install ...
 # We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
 pip install -r requirements.txt
---

`37-41`: **Annotate training command fence and align path/config examples.**

Add bash and ensure checkpoint path aligns with YAML defaults.


```diff
-```
+```bash
 torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
If the TE config uses "./checkpoints/vit_te", either:
- change README example to match, or
- update vit_te_base_patch16_224.yaml to "./ckpts/vit" for consistency.

---

`45-47`: **Fix broken/misleading link text.**

The link text says imagenet_*.py but points to imagenet_dataset.py. Make both consistent.

```diff
-The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py).
+The TIMM-derived model code for the ViT is in [`vit.py`](vit.py), and ImageNet data utilities are in [`imagenet_dataset.py`](imagenet_dataset.py).
recipes/vit/config/vit_te_base_patch16_224.yaml (2)
9-12: Add trailing newline to satisfy linters/pipeline.

Fixes EOF warning and CI “End-of-File fixer modified the file”.
     path: "./checkpoints/vit_te"
     resume_from_metric: "-"   # + = Highest Metric (Score), - = Lowest Metric (Loss)
+ 
9-12: Align checkpoint path with README or document divergence.

Either update README example to "./checkpoints/vit_te" when using this config or change this path to "./ckpts/vit" for consistency.
recipes/vit/config/vit_base_patch16_224.yaml (2)
11-24: Pooling/class token/reg_tokens combo is unusual for a base ViT.

Using global_pool="map" with class_token=true and reg_tokens=8 increases compute and may not be intended for a tiny demo. Consider reverting to token pooling with no reg tokens or disabling class_token when using MAP.
-    global_pool: "map"
-    class_token: true
-    reg_tokens: 8
+    global_pool: "token"
+    class_token: true
+    reg_tokens: 0
(Or keep MAP and set class_token: false.)

21-21: Init settings diverge from defaults; may impact reproducibility.

If not intentional, match defaults for stable behavior.
-    init_values: null
+    init_values: null  # keep or set per variant

-    weight_init: null
-    init_variance_rescale: true
+    weight_init: "timm"
+    init_variance_rescale: false
Also applies to: 36-39
recipes/vit/config/defaults.yaml (3)
95-99: Boolean style is inconsistent (True vs true).

YAML accepts both, but consistency helps avoid parser surprises and diffs.
-      normalize: True
+      normalize: true
100-117: Transforms look reasonable; consider exposing common aug knobs at top-level if you expect users to tweak frequently.

Optional; keeps defaults uncluttered but improves discoverability.

1-4: Add a brief header comment with the expected override points (data_root, num_classes, TE toggle).

Improves UX for folks new to Hydra configs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2d3d3c8 and a502309.

⛔ Files ignored due to path filters (288)

recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_106.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1074.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1118.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1224.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1225.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1270.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1277.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1318.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1325.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1339.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1359.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1381.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1386.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1417.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1432.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1555.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1568.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1603.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1636.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1654.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1660.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1671.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1740.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1802.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1855.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1900.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1934.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_1998.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_202.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_2026.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_2066.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_217.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_2234.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_2313.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_235.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_2354.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_2423.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/val/images/val_2454.JPEG is excluded by !**/*.jpeg

📒 Files selected for processing (12)

recipes/vit/.gitignore (1 hunks)
recipes/vit/.ruff.toml (1 hunks)
recipes/vit/Dockerfile (1 hunks)
recipes/vit/README.md (1 hunks)
recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/n09193705_boxes.txt (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 YAMLlint (1.37.1)

recipes/vit/config/vit_te_base_patch16_224.yaml

[error] 12-12: no new line character at the end of file

(new-line-at-end-of-file)

🪛 GitHub Actions: BioNeMo Framework CI

recipes/vit/config/vit_te_base_patch16_224.yaml

[error] 1-1: End-of-File fixer modified the file. Re-run pre-commit to apply and commit changes.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (7)

recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt (1)

1-500: Document bbox schema and inclusive bounds.
The dataset consistently uses (xmin, ymin, xmax, ymax) across all classes, with coordinates ranging from 0 to 63 inclusive for 64 × 64 images. Add a note in the dataset loader or README to state this explicitly. No normalization across files is needed and no degenerate boxes were found.

Likely an incorrect or invalid review comment.

recipes/vit/config/vit_te_base_patch16_224.yaml (1)

1-4: Defaults ordering LGTM.

Hydra defaults look correct with self last to allow overrides from included configs.

recipes/vit/config/vit_base_patch16_224.yaml (2)

64-67: Confirm resume_from_metric accepts "-" in your trainer.

If the trainer expects null/None or explicit "+"/"-", passing "-" as a string might silently no-op or misroute.

Would you like me to wire a small validation hook that asserts the field at startup?

51-54: Module path sanity check for FSDP unit modules.

Ensure vit.Block and vit.PatchEmbed resolve to the actual implementation module; mismatches will crash at wrap time.

I can add a startup import check that logs resolved types before FSDP wrapping if helpful.

recipes/vit/config/defaults.yaml (3)

49-61: Good: FSDP strategy values use descriptive strings.

Keeps configs readable and avoids magic numbers. Please keep other configs consistent with this format.

7-16: Defaults favor a plain ViT-B; sensible for a baseline.

qk_norm/scale_* disabled here aligns with timm ViT-B. Keep variant-specific tweaks in recipe configs.

43-48: Distributed defaults are sane for single-node demos.

No action needed.

recipes/vit/config/defaults.yaml

recipes/vit/config/vit_base_patch16_224.yaml

recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt

recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt

recipes/vit/data/super-tiny-imagenet-5/train/n09193705/n09193705_boxes.txt

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

recipes/vit/.ruff.toml (1)
2-3: Use extend-ignore to augment root ignores; avoid overriding them.

Setting lint.ignore here replaces the parent’s ignores. If you meant to add to them, use lint.extend-ignore. Also consider setting a higher McCabe threshold instead of disabling C901 entirely, and prefer targeted per-file-ignores for D/N to avoid muting useful signals across the whole recipe.

Apply:
 [lint]
-ignore = ["D", "N", "C901", "PLW2901"]
+extend-ignore = ["D", "N", "PLW2901"]
+
+# Prefer constraining complexity rather than disabling it entirely.
+[lint.mccabe]
+max-complexity = 20
Optionally, if only tests or scripts need relaxed rules:
[lint.per-file-ignores]
"**/tests/**/*.py" = ["D", "N"]
"**/scripts/*.py" = ["D", "N"]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a502309 and 99543a7.

📒 Files selected for processing (2)

recipes/vit/.ruff.toml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

recipes/vit/config/vit_te_base_patch16_224.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (1)

recipes/vit/.ruff.toml (1)

1-1: No update required: extend = "../.ruff.toml" correctly extends recipes/.ruff.toml; there is no root .ruff.toml to reference.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

recipes/vit/config/defaults.yaml (1)
70-72: Fix unsafe default: 100000 classes will instantiate an oversized head by default.

Set a sane default (e.g., ImageNet-1k), and override per recipe (e.g., 5 for super-tiny-ImageNet-5).

Apply:
-dataset:
-  num_classes: 100000
+dataset:
+  num_classes: 1000  # override to 5 in vit_*_patch16_224.yaml for super-tiny-imagenet-5

🧹 Nitpick comments (6)

recipes/vit/config/defaults.yaml (6)
78-79: Enable shuffling for training.

Non-shuffled training hurts convergence, even for tiny demos.
-    batch_size: 1
-    shuffle: false
+    batch_size: 1
+    shuffle: true
81-81: De-duplicate config by referencing model image/patch sizes.

Prevents drift when changing img_size/patch_size.
-      img_size: 224
+      img_size: ${model.vit.img_size}
@@
-      patch_size: 16
+      patch_size: ${model.vit.patch_size}
@@
-      img_size: 224
+      img_size: ${model.vit.img_size}
@@
-      patch_size: 16
+      patch_size: ${model.vit.patch_size}
Also applies to: 98-99, 108-108, 116-117

52-56: Avoid over-fragmenting FSDP units; and verify class paths.

Wrapping every torch.nn.Linear/LayerNorm can create thousands of FSDP units and tank perf. Prefer wrapping larger blocks; also ensure the vit.* symbols match actual classes.
   fsdp_unit_modules:
     - vit.Block
     - vit.PatchEmbed
-    - torch.nn.LayerNorm
-    - torch.nn.Linear
Would you like me to check the module paths across the repo to confirm they resolve?

66-68: Use a non-null sentinel for resume_from_metric.

If the loader treats null specially, a sentinel like "-" is safer.
   checkpoint:
     path: null
-    resume_from_metric: null
+    resume_from_metric: "-"
72-72: Make num_workers configurable via env with a sensible default.

Improves I/O without hardcoding per machine.
-  num_workers: 0
+  num_workers: ${oc.env:NUM_WORKERS,4}
If Hydra env interpolation isn’t available here, keep 0 and document overriding it per environment.

85-96: YAML style consistency (booleans/floats).

Standardize booleans to lowercase and floats to explicit decimals for readability.
-      vflip: 0.
+      vflip: 0.0
@@
-      color_jitter_prob: null
-      grayscale_prob: 0.
-      gaussian_blur_prob: 0.
+      color_jitter_prob: null
+      grayscale_prob: 0.0
+      gaussian_blur_prob: 0.0
@@
-      normalize: True
-      separate: False
+      normalize: true
+      separate: false
@@
-      patchify: False
+      patchify: false
@@
-      normalize: true
-      patch_size: 16
-      patchify: false
+      normalize: true
+      patch_size: ${model.vit.patch_size}
+      patchify: false
Also applies to: 115-117

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 99543a7 and 2f9a215.

📒 Files selected for processing (2)

recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

recipes/vit/config/vit_base_patch16_224.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (1)

recipes/vit/config/defaults.yaml (1)

6-7: Good: derive classifier size from dataset config.

Binding model.vit.num_classes to ${dataset.num_classes} avoids head/dataset drift.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (10)

recipes/vit/README.md (10)

3-3: Fix TIMM casing and avoid bare URL (MD034).

Use a proper Markdown link and correct “TImM” → “timm”.

-_Adapted ViT model code from huggingface/pytorch-image-models (TImM), which you can check out here: https://github.com/huggingface/pytorch-image-models_
+_Adapted ViT model code from [timm (huggingface/pytorch-image-models)](https://github.com/huggingface/pytorch-image-models)._

5-7: Fix heading levels (MD001) and minor wording.

H3 → H2 for section, H4 → H3 for subsection; “Pre-Requisites” → “Prerequisites”.

-### Pre-Requisites
-
-#### Docker Container
+## Prerequisites
+
+### Docker container

22-24: Make heading level consistent (MD001) and casing.

-#### PIP Install
+### pip install

11-13: Add language to fenced code block (MD040).

-```
+```bash
 docker build -t <image_repo>:<image_tag> .


---

`17-20`: **Add language to fenced code block (MD040).**


```diff
-```
+```bash
 # Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!
 docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>


---

`26-31`: **Add language to fenced code block (MD040) and tiny clarity note.**


```diff
-```
+```bash
 cd recipes/vit
 # If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
 # We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
 pip install -r requirements.txt


---

`37-39`: **Define NGPU, ensure checkpoint dir exists, and add language to code block.**

Without NGPU set, the command fails. Also pre-create the output path.


```diff
-```
-torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
-```
+```bash
+# Set GPUs (override by exporting NGPU beforehand)
+export NGPU=${NGPU:-2}
+mkdir -p ./ckpts/vit
+torchrun --nproc-per-node "${NGPU}" train.py --config-name vit_base_patch16_224 distributed.dp_shard="${NGPU}" training.checkpoint.path=./ckpts/vit
+```

41-41: Possible broken PyTorch docs URL; verify and update.

Likely correct domain is pytorch.org/docs, not docs.pytorch.org/docs.

-... save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints ...
+... save auto-resumable [Torch DCP](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints ...

43-45: Verify relative links and align anchor text.

Confirm these files exist under recipes/vit/ and fix the mismatched anchor text vs link.

-[`train.py`](train.py) is the transparent entrypoint to this script ...
+[`train.py`](./train.py) is the transparent entrypoint to this recipe ...
 ...
-The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py).
+The TIMM-derived model code for the ViT can be found in [`vit.py`](./vit.py), and ImageNet dataset utilities can be found in [`imagenet_dataset.py`](./imagenet_dataset.py).

43-43: Minor wording nit.

“entrypoint to this script” reads oddly; use “entrypoint script” or “entrypoint to this recipe.”

-[`train.py`](./train.py) is the transparent entrypoint to this script
+[`train.py`](./train.py) is the transparent entrypoint script

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2f9a215 and c5a719b.

⛔ Files ignored due to path filters (229)

recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_34.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_35.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_36.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_37.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_38.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_39.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_4.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_40.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_41.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_42.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_43.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_44.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_45.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_46.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_47.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_48.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_49.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_5.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_6.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_7.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_8.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_9.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_0.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_1.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_10.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_11.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_12.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_13.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_14.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_15.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_16.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_17.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_18.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_19.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_2.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_20.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_21.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_22.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_23.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_24.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_25.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_26.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_27.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_28.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_29.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_3.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_30.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_31.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_32.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_33.JPEG is excluded by !**/*.jpeg
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_34.JPEG is excluded by !**/*.jpeg

📒 Files selected for processing (11)

recipes/vit/.gitignore (1 hunks)
recipes/vit/.ruff.toml (1 hunks)
recipes/vit/Dockerfile (1 hunks)
recipes/vit/README.md (1 hunks)
recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt (1 hunks)
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt (1 hunks)

✅ Files skipped from review due to trivial changes (1)

recipes/vit/.gitignore

🚧 Files skipped from review as they are similar to previous changes (9)

recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt
recipes/vit/.ruff.toml
recipes/vit/Dockerfile
recipes/vit/config/vit_te_base_patch16_224.yaml
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt
recipes/vit/config/vit_base_patch16_224.yaml
recipes/vit/config/defaults.yaml
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pre-commit
GitHub Check: Analyze (rust)
GitHub Check: Analyze (python)

recipes/vit/imagenet_dataset.py

recipes/vit/vit.py

recipes/vit/test_train.py

recipes/vit/README.md

recipes/vit/train.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

recipes/vit/README.md (1)
3-3: Fix attribution formatting, link, and provenance note; also add matching header in vit.py

Use a proper Markdown link (avoids MD034), correct "timm" casing, and point readers to the bundled LICENSE and a provenance header in vit.py. This aligns with the prior ask to include copyright/provenance in code.
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_
+_Adapted ViT model code from [timm (pytorch-image-models)](https://github.com/huggingface/pytorch-image-models) by Ross Wightman (Copyright 2020). See `recipes/vit/LICENSE` and provenance notes in `vit.py`._

🧹 Nitpick comments (8)

recipes/vit/README.md (8)

5-5: Fix heading levels and wording (MD001)
Heading levels should increment by one. Also standardize wording/casing.

-### Pre-Requisites
+## Prerequisites
@@
-#### Docker Container
+### Docker container
@@
-#### PIP Install
+### pip install

Also applies to: 7-7, 22-22

11-11: Specify languages for fenced blocks (MD040)
Add bash language to all shell code fences.

-```
+```bash
@@
-```
+```bash
@@
-```
+```bash
@@
-```
+```bash

Also applies to: 17-17, 26-26, 37-37

19-19: Mount workspace into the container
Without a bind mount and workdir, users won’t see repo files inside the container.

-docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
+docker run -it --rm --gpus=all --shm-size=16G -v "$(pwd)":/workspace -w /workspace <image_repo>:<image_tag>

37-39: Show NGPU definition before torchrun
Make the example self-contained by defining NGPU.

-```bash
-torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
-```
+```bash
+export NGPU=2  # example
+torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
+```

41-41: Fix (likely) broken PyTorch DCP docs URL
“docs.pytorch.org/docs/...” looks wrong; PyTorch docs are usually at pytorch.org/docs/...

-which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
+which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.

43-43: Minor wording/typo fixes
“entrypoint” → “entry point”; “de-compressed” → “decompressed”.

-[`train.py`](train.py) is the transparent entrypoint to this script that explains how to modify your own training loop for `Megatron-FSDP` ([PyPI: `megatron-fsdp`](https://pypi.org/project/megatron-fsdp/) / [Source: Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src)) to fully-shard your model across all devices. After executing `train.py` for the first time, the de-compressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` (sourced from [`super-tiny-imagenet-5.tar.gz`](./data/super-tiny-imagenet-5.tar.gz)) for experimentation and review.
+[`train.py`](train.py) is the transparent entry point to this recipe that explains how to modify your own training loop for `Megatron-FSDP` ([PyPI: `megatron-fsdp`](https://pypi.org/project/megatron-fsdp/) / [Source: Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src)) to fully shard your model across all devices. After executing `train.py` for the first time, the decompressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` (sourced from [`super-tiny-imagenet-5.tar.gz`](./data/super-tiny-imagenet-5.tar.gz)) for experimentation and review.

45-45: Normalize “timm” casing
Use “timm” (project’s canonical name) for consistency.

-The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py).
+The timm-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py).

38-39: Optional: add TE config example next to the base config
Make it obvious how to enable TransformerEngine.

Example block to append after the base command:

# TransformerEngine variant
export NGPU=2
torchrun --nproc-per-node ${NGPU} train.py --config-name vit_te_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit_te

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4408fb5 and 68b2b7a.

⛔ Files ignored due to path filters (1)

recipes/vit/data/super-tiny-imagenet-5.tar.gz is excluded by !**/*.gz

📒 Files selected for processing (3)

recipes/vit/.gitignore (1 hunks)
recipes/vit/LICENSE (1 hunks)
recipes/vit/README.md (1 hunks)

✅ Files skipped from review due to trivial changes (1)

recipes/vit/LICENSE

🚧 Files skipped from review as they are similar to previous changes (1)

recipes/vit/.gitignore

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests (recipes/vit, nvcr.io/nvidia/pytorch:25.06-py3)
GitHub Check: Analyze (rust)

recipes/vit/README.md

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (1)

recipes/vit/README.md (1)

22-31: Replace incorrect PIP_CONSTRAINT note; recommend venv and fence language.

This mirrors a prior comment; use a virtual environment and remove the misleading PIP_CONSTRAINT guidance.

-#### PIP Install
+### Python environment (pip)
@@
-If you have a virtual environment and CUDA installed, you can install the recipe's dependencies using `pip`:
+If you have CUDA installed, install the recipe’s dependencies in an isolated virtual environment:
@@
-```
-cd recipes/vit
-# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
-# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
-pip install -r requirements.txt
-```
+```bash
+cd recipes/vit
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```

🧹 Nitpick comments (11)

recipes/vit/imagenet_dataset.py (1)

304-329: Consider adding a maximum retry limit configuration.

The current implementation retries up to 50 consecutive errors, which is hardcoded. Consider making this configurable for different use cases.

You could add a parameter to the constructor:
 def __init__(
     self,
     root,
     reader=None,
     class_map: Optional[str | dict] = None,
     label_map: Optional[str | dict] = None,
     load_bytes=False,
     input_img_mode="RGB",
     transform=None,
     target_transform=None,
     class_filter: Optional[List[Any]] = None,
+    max_consecutive_errors: int = 50,
     **kwargs,
 ):
Then use it in __getitem__:
-            if self._consecutive_errors < 50:
+            if self._consecutive_errors < self.max_consecutive_errors:

recipes/vit/vit.py (1)

1357-1402: Add sequence‐length assertion before TransformerEngine blocks
In the model’s forward (e.g. in VisionTransformer.forward), assert that x.shape[1] == embed_len (where embed_len = num_patches + num_prefix_tokens when no_embed_class=False, else num_patches) before calling self.blocks. This will catch any mismatch between the input sequence length and the seq_length passed to the TransformerEngine.

recipes/vit/train.py (2)

197-201: Document the sub-mesh flattening behavior.

The warning about sub-meshes being deleted by Torch is important. Consider adding more detailed documentation about when and why these references might be deleted.

Add more explicit documentation:

     # Sub-meshes (possibly) required for Megatron-FSDP.
-    # WARNING: These have a tendency to be deleted by Torch. Save references
-    # or pass them to all classes or functions that use them.
+    # WARNING: PyTorch's DeviceMesh sub-meshes can be garbage collected if no strong
+    # references are kept. These flattened meshes MUST be passed directly to functions
+    # that use them or stored as attributes to prevent deletion. The parent device_mesh
+    # alone does not keep these sub-meshes alive.

137-145: Consider using Path.chmod with explicit permissions.

While the current implementation works, using Path's chmod method would be more Pythonic and avoid mixing os and Path operations.

-        # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root.
-        mode = 0o777
-        for dirpath, _, filenames in os.walk(ckpt_dir):
-            # Change current directory perms.
-            os.chmod(dirpath, mode)
-            for filename in filenames:
-                # Change file perms.
-                file_path = Path(dirpath) / filename
-                os.chmod(file_path, mode)
+        # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root.
+        mode = 0o777
+        for path in ckpt_dir.rglob("*"):
+            path.chmod(mode)
+        ckpt_dir.chmod(mode)

recipes/vit/README.md (7)

3-3: Fix TIMM attribution formatting and avoid bare URL (MD034).

Use a proper Markdown link and correct capitalization.

-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_
+_Adapted ViT model code from **timm** (PyTorch Image Models) by Ross Wightman ([@rwightman]) © 2020 — see [huggingface/pytorch-image-models](https://github.com/huggingface/pytorch-image-models)._

5-8: Fix heading levels and spelling (MD001).

H2 should follow H1; child sections should increment by one; also “Prerequisites” spelling.

-### Pre-Requisites
+## Prerequisites
-
-#### Docker Container
+### Docker container

11-13: Add language to code fence (MD040).

-```
+```bash
 docker build -t <image_repo>:<image_tag> .


---

`17-20`: **Add language to code fence (MD040).**


```diff
-```
+```bash
 # Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!
 docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>


---

`35-39`: **Specify code fence language and show NGPU example to reduce confusion.**

Add bash fence and an explicit NGPU example (matches your 2‑GPU guidance).


Confirm whether you want to highlight the TE config by default (vit_te_base_patch16_224) or keep the base variant here; both exist per configs.

```diff
-```
-torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
-```
+```bash
+# Example: 2 GPUs
+NGPU=2
+torchrun --nproc-per-node ${NGPU} train.py \
+  --config-name vit_base_patch16_224 \
+  distributed.dp_shard=${NGPU} \
+  training.checkpoint.path=./ckpts/vit
+```

41-43: Fix PyTorch docs URL and minor wording.

Correct docs host, and “decompressed” spelling.

-which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
+which will train on a local tiny 5‑class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto‑resumable [Torch Distributed Checkpoint](https://pytorch.org/docs/stable/distributed.checkpoint.html) artifacts to the `training.checkpoint.path` directory.
@@
-[...] After executing `train.py` for the first time, the de-compressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` [...]
+[...] After executing `train.py` for the first time, the decompressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` [...]

45-46: Fix link text/target mismatch; link both dataset utilities explicitly.

-The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py).
+The TIMM‑derived model code for the ViT is in [`vit.py`](vit.py), and ImageNet data utilities are in [`imagenet_dataset.py`](imagenet_dataset.py) and [`imagenet_utils.py`](imagenet_utils.py).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 68b2b7a and ba27aa8.

📒 Files selected for processing (6)

recipes/vit/LICENSE (1 hunks)
recipes/vit/README.md (1 hunks)
recipes/vit/imagenet_dataset.py (1 hunks)
recipes/vit/imagenet_utils.py (1 hunks)
recipes/vit/train.py (1 hunks)
recipes/vit/vit.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

recipes/vit/imagenet_utils.py
recipes/vit/LICENSE

🧰 Additional context used

🧬 Code graph analysis (1)

recipes/vit/train.py (3)

recipes/vit/imagenet_dataset.py (4)

ImageNetDataset (250-337)

infinite_dataloader (38-45)

filename (243-244)

filename (333-334)

recipes/vit/imagenet_utils.py (2)

transforms_imagenet_eval (946-1033)

transforms_imagenet_train (775-943)

recipes/vit/vit.py (1)

VisionTransformer (1218-1657)

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: changed-files
GitHub Check: Analyze (rust)

🔇 Additional comments (5)

recipes/vit/imagenet_dataset.py (1)

38-45: LGTM! Clean infinite dataloader implementation.

The implementation correctly handles epoch updates for proper shuffling in distributed training scenarios.

recipes/vit/vit.py (2)

1454-1456: LGTM! Proper meta device initialization support.

The reset_parameters method correctly delegates to init_weights() for Megatron-FSDP meta device initialization compatibility.

95-96: No changes needed for LayerScale.reset_parameters: it correctly resets gamma—the only parameter—for Megatron-FSDP meta-device initialization.

recipes/vit/train.py (1)

333-336: LGTM! Important persistent workers configuration.

The comment correctly highlights the critical requirement for persistent_workers=True when using Megatron-FSDP and Torch DCP to prevent worker conflicts.

recipes/vit/README.md (1)

47-47: Config path reference looks good.

recipes/vit/imagenet_dataset.py

recipes/vit/train.py

recipes/vit/vit.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

recipes/vit/train.py (3)

58-100: Checkpoint name parsing is still vulnerable to malformed names.

The code assumes checkpoints follow the exact format step_<step_idx>_loss_<loss_value> and directly accesses split indices without validation, which can cause IndexError for malformed checkpoint names.

The previous suggestion to add validation for checkpoint name parsing should still be implemented to prevent crashes from malformed checkpoint directories.

304-308: Tar extraction lacks error handling.

The tar extraction could fail due to permissions, disk space, or corrupted archives without proper error handling.

The previous suggestion to add error handling for tar extraction should still be implemented to prevent silent failures or unclear error messages.

387-391: Dataloader resume logic may cause data duplication/skipping.

Using start=latest_step_idx in enumerate() only shifts the batch index counter—it doesn't advance the underlying iterator. This means the dataloader will restart from the beginning of the dataset rather than resuming from the correct position.

The dataloader should properly skip the first latest_step_idx batches from the infinite iterator to ensure correct resumption without data duplication or skipping.

🧹 Nitpick comments (1)

recipes/vit/train.py (1)

116-148: Consider security implications of changing file permissions.

Setting all files and directories to 0o777 (world-writable) could create security vulnerabilities in multi-user environments.

Consider using more restrictive permissions or making this configurable:

-        mode = 0o777
+        # Use more secure permissions - owner and group can write, others can read
+        dir_mode = 0o775
+        file_mode = 0o664
         for dirpath, _, filenames in os.walk(ckpt_dir):
             # Change current directory perms.
-            os.chmod(dirpath, mode)
+            os.chmod(dirpath, dir_mode)
             for filename in filenames:
                 # Change file perms.
                 file_path = Path(dirpath) / filename
-                os.chmod(file_path, mode)
+                os.chmod(file_path, file_mode)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ba27aa8 and 37c0562.

📒 Files selected for processing (3)

recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/train.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

recipes/vit/config/vit_base_patch16_224.yaml
recipes/vit/config/defaults.yaml

🧰 Additional context used

🧬 Code graph analysis (1)

recipes/vit/train.py (3)

recipes/vit/imagenet_dataset.py (6)

ImageNetDataset (250-337)

infinite_dataloader (38-45)

filenames (246-247)

filenames (336-337)

filename (243-244)

filename (333-334)

recipes/vit/imagenet_utils.py (2)

transforms_imagenet_eval (946-1033)

transforms_imagenet_train (775-943)

recipes/vit/vit.py (1)

VisionTransformer (1218-1657)

🔇 Additional comments (12)

recipes/vit/train.py (12)

1-38: LGTM! Clean imports and structure.

The imports are well-organized and the file structure follows good practices with proper licensing and clear separation of concerns.

43-56: LGTM! Well-documented checkpoint loading function.

The function properly handles distributed checkpoint loading with clear documentation and appropriate state dict management.

103-114: LGTM! Clean checkpoint saving implementation.

The function properly handles distributed checkpoint saving with appropriate state dict structure.

150-201: LGTM! Comprehensive device mesh setup.

The device mesh setup is well-documented and properly handles the distributed training configuration with appropriate validation and sub-mesh creation.

204-252: LGTM! Clean model initialization with TransformerEngine support.

The model initialization properly handles both standard and TransformerEngine configurations with appropriate conditional logic and meta device support.

254-296: LGTM! Comprehensive FSDP configuration.

The Megatron-FSDP setup is well-documented with clear explanations of each parameter and appropriate device mesh integration.

309-364: LGTM! Proper dataset setup with distributed sampling.

The dataset initialization properly handles both training and validation datasets with appropriate distributed sampling configuration and persistent workers for FSDP compatibility.

369-374: LGTM! Standard training utilities setup.

The loss function and LR scheduler initialization are appropriate for the training configuration.

392-418: LGTM! Standard training step implementation.

The training step properly handles data movement, forward/backward passes, gradient clipping, and optimizer updates with appropriate timing measurements.

419-450: LGTM! Proper validation loop with distributed reduction.

The validation implementation correctly handles distributed loss reduction and checkpoint saving with appropriate model state management.

451-494: LGTM! Comprehensive logging and monitoring.

The logging implementation provides detailed metrics for both stdout and WandB with appropriate timing measurements and memory monitoring.

495-514: LGTM! Proper cleanup and profiling output.

The cleanup code properly handles memory profiler dumping, WandB finishing, and process group destruction.

recipes/vit/train.py

coderabbitai

Actionable comments posted: 17

♻️ Duplicate comments (4)

recipes/vit/train.py (3)

113-115: Checkpoint name parsing is brittle in load_auto_resume_checkpoint

This is implemented in recipes/vit/checkpoint.py; calling site inherits the risk.

Ensure malformed dirs are skipped (parse+validate parts, handle ValueError/IndexError) as previously suggested in review. Would you like a patch rebased onto current checkpoint.py?

120-123: Add error handling when extracting the dataset tar

Apply:

-        if not Path(cfg.dataset.train.root).exists():
-            with tarfile.open(Path(cfg.dataset.train.root).parent.with_suffix(".tar.gz"), "r:gz") as tar:
-                tar.extractall(Path(cfg.dataset.train.root).parent.parent)
+        if not Path(cfg.dataset.train.root).exists():
+            tar_path = Path(cfg.dataset.train.root).parent.with_suffix(".tar.gz")
+            if not tar_path.exists():
+                raise FileNotFoundError(f"Dataset archive not found: {tar_path}")
+            try:
+                with tarfile.open(tar_path, "r:gz") as tar:
+                    tar.extractall(Path(cfg.dataset.train.root).parent.parent)
+            except (tarfile.TarError, OSError) as e:
+                raise RuntimeError(f"Failed to extract dataset: {e}")

199-208: Fix resume semantics: enumerate(start=...) doesn’t skip data

This only offsets the counter; it does not advance the iterator. Use islice to drop items so training truly resumes.

Apply:

+        from itertools import islice
@@
-        for batch_idx, (input, target) in enumerate(
-            # Skip to latest step.
-            infinite_dataloader(train_dataloader, train_sampler),
-            start=latest_step_idx,
-        ):
+        train_iter = islice(infinite_dataloader(train_dataloader, train_sampler), latest_step_idx, None)
+        for batch_idx, (input, target) in zip(range(latest_step_idx, cfg.training.steps), train_iter):

And remove the manual break below. See next hunk.

recipes/vit/README.md (1)

22-31: Prefer venv; remove incorrect PIP_CONSTRAINT guidance

Apply:

-#### PIP Install
+### PIP Install
@@
-```
-cd recipes/vit
-# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
-# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
-pip install -r requirements.txt
-```
+```bash
+cd recipes/vit
+# Recommended: create and activate a clean virtual environment (or use the provided Docker image)
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+```

🧹 Nitpick comments (21)

recipes/vit/checkpoint.py (4)

47-51: Typo and clarity in warning message

Typo in the quoted prefix and clearer guidance.

Apply:

-        _logger.warning(
-            f"Checkpoint state dictionary keys ({first_key}) may be prefixed "
-            "with 'modele.' if converted from a Megatron-FSDP DCP checkpoint."
-            "Set megatron_fsdp=True to automatically strip the prefix."
-        )
+        _logger.warning(
+            f"State dict keys (e.g., {first_key}) appear to be prefixed with 'module.' "
+            "from a Megatron-FSDP checkpoint. Set megatron_fsdp=True to strip it."
+        )

52-55: Consider safer torch.load usage

Default to weights_only=True when possible to reduce pickle attack surface; fall back if optimizer/extra payloads are present.

Example:

-    # Must load with weights_only=False if you have an optimizer state in your checkpoint.
-    ckpt = torch.load(checkpoint_path, weights_only=False)
+    # Prefer weights_only=True; fall back if extra payloads are present.
+    try:
+        ckpt = torch.load(checkpoint_path, weights_only=True)
+    except TypeError:
+        ckpt = torch.load(checkpoint_path, weights_only=False)

91-109: Auto-resume: directory parsing robustness

Assumes every subdir matches step_<i>_loss_<val>. Guard against malformed names and empty parts to avoid ValueError.

Apply:

-        subdirs = [x.absolute() for x in Path(cfg.training.checkpoint.path).iterdir() if x.is_dir()]
+        subdirs = [x.absolute() for x in Path(cfg.training.checkpoint.path).iterdir() if x.is_dir()]
+        # Keep only dirs matching the expected pattern.
+        def _parse(dirpath: Path):
+            parts = dirpath.name.split("_")
+            if len(parts) == 4 and parts[0] == "step" and parts[2] == "loss":
+                try:
+                    return int(parts[1]), float(parts[3])
+                except ValueError:
+                    return None
+            return None
+        parsed = [(d, _parse(d)) for d in subdirs]
+        subdirs = [d for d, p in parsed if p is not None]
         if len(subdirs) > 0:
-            opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1
+            opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1
             latest_subdir = max(
                 subdirs,
                 key=lambda x: (
-                    opt_metric_coeff * float(x.name.split("_")[3])
+                    opt_metric_coeff * float(x.name.split("_")[3])
                     if cfg.training.checkpoint.resume_from_metric
                     else 0,
                     x.stat().st_mtime,
                 ),
             )

118-130: Use stateful objects for DCP save

You can pass model/optimizer directly; avoids intermediate dicts and preserves future extensibility.

Apply:

-    state_dict = {}
-    if model is not None:
-        state_dict["model"] = model.state_dict()
-    if optimizer is not None:
-        state_dict["optimizer"] = optimizer.state_dict()
-    torch.distributed.checkpoint.save(state_dict, checkpoint_id=checkpoint_path)
+    state = {}
+    if model is not None:
+        state["model"] = model
+    if optimizer is not None:
+        state["optimizer"] = optimizer
+    torch.distributed.checkpoint.save(state, checkpoint_id=checkpoint_path)

recipes/vit/infer.py (2)

31-33: Docstring mismatch

Doc says “Non-distributed inference” but code initializes a process group and device mesh. Update wording.

Apply:
-    """
-    Inference script for ViT. Non-distributed inference.
-    """
+    """Inference script for ViT (single-rank supported, initializes a distributed context)."""
36-49: Optional: CPU fallback for broader usability

If you want tests/usage to run on CPU-only, gate CUDA calls or add a fallback.

Example:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = build_vit_model(cfg, device_mesh).to(device)
input = torch.randn(1, 3, 224, 224, device=device)

recipes/vit/distributed.py (1)

76-77: Destroy process group only if initialized

Minor safety.

Apply:

-    torch.distributed.destroy_process_group()
+    if torch.distributed.is_initialized():
+        torch.distributed.destroy_process_group()

recipes/vit/train.py (3)

308-310: Remove manual break; loop now bounded by zip(range(...))

Apply:

-            # Terminate if completed training steps.
-            if batch_idx >= cfg.training.steps:
-                break
+            # loop bounded by range(cfg.training.steps) via zip()

220-221: Use non_blocking GPU transfers (paired with pin_memory)

Small speed win; consistent with pin_memory above.

Apply:

-            input = input.cuda()
-            target = target.cuda()
+            input = input.cuda(non_blocking=True)
+            target = target.cuda(non_blocking=True)

299-306: Update progress bar every step (not only on log intervals)

Current update inside the log block undercounts progress.

Apply:

-                progress_bar.update(1)
+                # logging only
@@
-            # Reset timer.
+            # Reset timer and advance progress.
             t_start = time.perf_counter()
+            if torch.distributed.get_rank() == 0:
+                progress_bar.update(1)

recipes/vit/README.md (6)

3-3: Fix TIMM attribution text and avoid bare URL

Apply:
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_
+_Adapted ViT model code from [huggingface/pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) by Ross Wightman (@rwightman, Copyright 2020)._
5-22: Normalize heading levels and add languages to fenced blocks

Addresses MD001/MD040.

Apply:
-### Pre-Requisites
+## Pre-Requisites
@@
-#### Docker Container
+### Docker Container
@@
-```
+```bash
 docker build -t <image_repo>:<image_tag> .
@@
- +bash

Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!

docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
33-41: Heading level + fenced language + DCP link style

Apply:
-### Training a Vision Transformer
+## Training a Vision Transformer
@@
-```
+```bash
 torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
@@
-which will train on a local tiny 5-class version of ImageNet (super-tiny-imagenet-5) and save auto-resumable Torch DCP checkpoints to the training.checkpoint.path directory.
+which will train on a local tiny 5-class version of ImageNet (super-tiny-imagenet-5) and save auto-resumable Torch DCP checkpoints to the training.checkpoint.path directory.
---

`49-66`: **Add languages to fenced blocks**


Apply:

```diff
-#### Checkpoint Conversion
+### Checkpoint Conversion
@@
-```
+```bash
 python -m torch.distributed.checkpoint.format_utils --help
 usage: format_utils.py [-h] {torch_to_dcp,dcp_to_torch} src dst
@@
-```
+```
67-76: Add languages to fenced blocks

Apply:
-```
+```bash
 python -m torch.distributed.checkpoint.format_utils dcp_to_torch step_75_loss_1.725 torch_ckpt_test.pt
@@
- +python
from torch.distributed.checkpoint.format_utils import dcp_to_torch_save, torch_save_to_dcp
@@
torch_save_to_dcp(TORCH_SAVE_CHECKPOINT_PATH, f"{CHECKPOINT_DIR}_new")
108-108: Avoid bare URL; use a link

Apply:
-TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, you can look into: https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/
+TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, see: [PyTorch blog: HuggingFace safetensors support in DCP](https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/).

recipes/vit/test_train.py (2)

53-55: Redundant resume override.

vit_resume_config.training.steps = 10 repeats the already-set value from the compose overrides. Remove to reduce noise.
-        vit_resume_config = deepcopy(vit_config)
-        vit_resume_config.training.steps = 10
+        vit_resume_config = deepcopy(vit_config)
58-61: Make the checkpoint assertion robust.

Depending on the saver (best/latest/step), more than one directory can exist after the first run. Assert “at least one”.
-    assert sum(1 for item in (Path(tmp_path) / "ckpt").iterdir() if item.is_dir()) == 1, (
-        "Expected 1 checkpoint with 10 training steps and validation interval of 10."
-    )
+    assert sum(1 for item in (Path(tmp_path) / "ckpt").iterdir() if item.is_dir()) >= 1, (
+        "Expected at least 1 checkpoint with 10 training steps and validation interval of 10."
+    )

recipes/vit/vit.py (3)

1466-1473: Avoid fragile access to TransformerEngine internals.

Attribute names in TE are not stable. Gate rescaling by attribute presence to prevent AttributeError when init_variance_rescale=True.

-        for layer_id, layer in enumerate(self.blocks):
-            if self.block_fn == TransformerLayer:
-                rescale(layer.self_attention.proj.weight.data, layer_id + 1)
-                rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1)
+        for layer_id, layer in enumerate(self.blocks):
+            if self.block_fn == TransformerLayer:
+                if hasattr(layer, "self_attention") and hasattr(layer.self_attention, "proj") and hasattr(
+                    layer.self_attention.proj, "weight"
+                ):
+                    rescale(layer.self_attention.proj.weight.data, layer_id + 1)
+                if hasattr(layer, "layernorm_mlp") and hasattr(layer.layernorm_mlp, "fc2_weight"):
+                    rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1)
             else:
                 rescale(layer.attn.proj.weight.data, layer_id + 1)
                 rescale(layer.mlp.fc2.weight.data, layer_id + 1)

700-703: Respect tuple img_size.

Current code overwrites tuples; support both int and (H, W).

-        img_size = (img_size, img_size)
+        if isinstance(img_size, int):
+            img_size = (img_size, img_size)
+        else:
+            assert len(img_size) == 2, "img_size tuple must be (H, W)"

1062-1064: Raise explicit error on unknown pool type.

assert not pool_type can be optimized out; raise instead.

-        else:
-            assert not pool_type, f"Unknown pool type {pool_type}"
+        else:
+            raise AssertionError(f"Unknown pool type {pool_type}")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 37c0562 and 44e160b.

📒 Files selected for processing (12)

recipes/vit/README.md (1 hunks)
recipes/vit/checkpoint.py (1 hunks)
recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)
recipes/vit/distributed.py (1 hunks)
recipes/vit/imagenet_dataset.py (1 hunks)
recipes/vit/infer.py (1 hunks)
recipes/vit/test_infer.py (1 hunks)
recipes/vit/test_train.py (1 hunks)
recipes/vit/train.py (1 hunks)
recipes/vit/vit.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

recipes/vit/config/defaults.yaml
recipes/vit/config/vit_te_base_patch16_224.yaml
recipes/vit/imagenet_dataset.py
recipes/vit/config/vit_base_patch16_224.yaml

🧰 Additional context used

🧬 Code graph analysis (4)

recipes/vit/infer.py (3)

recipes/vit/checkpoint.py (1)

load_torch_checkpoint (27-54)

recipes/vit/distributed.py (1)

initialize_distributed (23-77)

recipes/vit/vit.py (1)

build_vit_model (70-98)

recipes/vit/test_infer.py (4)

recipes/vit/checkpoint.py (1)

save_dcp_checkpoint (118-129)

recipes/vit/distributed.py (1)

initialize_distributed (23-77)

recipes/vit/infer.py (1)

main (30-49)

recipes/vit/vit.py (1)

build_vit_model (70-98)

recipes/vit/train.py (5)

recipes/vit/checkpoint.py (2)

load_auto_resume_checkpoint (73-115)

save_auto_resumable_checkpoint (132-163)

recipes/vit/distributed.py (1)

initialize_distributed (23-77)

recipes/vit/imagenet_dataset.py (2)

ImageNetDataset (250-337)

infinite_dataloader (38-45)

recipes/vit/imagenet_utils.py (2)

transforms_imagenet_eval (946-1033)

transforms_imagenet_train (775-943)

recipes/vit/vit.py (1)

build_vit_model (70-98)

recipes/vit/test_train.py (1)

recipes/vit/train.py (1)

main (44-325)

🪛 LanguageTool

recipes/vit/README.md

[grammar] ~108-~108: There might be a mistake here.
Context: ...rt-in-pytorch-distributed-checkpointing/

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

53-53: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

69-69: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

75-75: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

108-108: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (4)

recipes/vit/test_infer.py (1)

51-59: DCP→Torch conversion OK

The staged DCP save and conversion to Torch format looks correct and aligns with the loader’s expectations.
recipes/vit/distributed.py (1)
63-71: Persist flattened sub-meshes to avoid GC
Capture each flattened sub-mesh instead of discarding it (e.g.
dp_mesh = device_mesh[("dp_inter","dp_shard")]._flatten("dp")
dp_cp_shard_mesh = device_mesh[("dp_shard","cp")]._flatten("dp_cp_shard")
hsdp_mesh = device_mesh[("dp_inter","dp_shard","cp")]._flatten("hsdp")
device_mesh._submeshes = {"dp": dp_mesh, "dp_cp_shard": dp_cp_shard_mesh, "hsdp": hsdp_mesh}
)

Check for a supported public API in place of the private _flatten and update to use it if available.
recipes/vit/train.py (1)

131-136: DeviceMesh “dp” sub-mesh is available

initialize_distributed flattens ("dp_inter","dp_shard") into "dp" before yielding the mesh, so device_mesh["dp"].size() and device_mesh["dp"].get_local_rank() are guaranteed and will not raise a KeyError.

recipes/vit/vit.py (1)

81-99: Ignore meta-device init suggestion — with torch.device('meta') is a supported context manager (including for “meta”)[cite] and will reliably allocate all submodule parameters on meta; no change needed.

Likely an incorrect or invalid review comment.

recipes/vit/checkpoint.py

recipes/vit/train.py

recipes/vit/vit.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (11)

recipes/vit/test_infer.py (3)

28-29: Skip on CPU-only runners to avoid hard CUDA dependency in CI.
-@pytest.mark.parametrize("config_name", ["vit_base_patch16_224", "vit_te_base_patch16_224"])
+@pytest.mark.parametrize("config_name", ["vit_base_patch16_224", "vit_te_base_patch16_224"])
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="requires CUDA")
64-65: Prefer an undecorated run_inference(cfg) that returns output; assert shape.

Current test only checks no-exception. Refactor infer.py to expose run_inference(cfg)->Tensor and use it here.
-# Run inference.
-main(vit_config)
+# Run inference.
+from .infer import run_inference  # move import to header
+out = run_inference(vit_config)
+assert out.shape[0] == 1
22-25: Fix imports to be package-relative; current form breaks under pytest from repo root.

Make recipes/vit a package and use relative imports.
-from checkpoint import save_dcp_checkpoint
-from distributed import initialize_distributed
-from infer import main
-from vit import build_vit_model
+from .checkpoint import save_dcp_checkpoint
+from .distributed import initialize_distributed
+from .infer import main  # or run_inference (see below)
+from .vit import build_vit_model
Also add:
# recipes/vit/__init__.py
"""ViT recipe package."""

recipes/vit/checkpoint.py (5)

2-2: Use canonical SPDX identifier "Apache-2.0".

-# SPDX-License-Identifier: LicenseRef-Apache2
+# SPDX-License-Identifier: Apache-2.0

37-45: Don’t use str.strip for prefix removal; it corrupts keys.

strip() removes any of the characters, not the exact prefix.

-    checkpoint = torch.load(checkpoint_path, weights_only=False)
-    # Remove the "module." prefix from the keys of checkpoints
-    # derived from Megatron-FSDP.
-    model_checkpoint = {(k.strip("module.") if megatron_fsdp else k): v for k, v in checkpoint["model"].items()}
-    # Warn about Megatron-FSDP checkpoints.
-    first_key = next(iter(model_checkpoint))
-    if first_key.startswith("module.") and not megatron_fsdp:
+    checkpoint = torch.load(checkpoint_path, weights_only=False)
+    def _rm_prefix(name: str, prefix: str = "module.") -> str:
+        return name[len(prefix):] if name.startswith(prefix) else name
+    # Handle Megatron‑FSDP "module." prefix correctly.
+    raw_keys = list(checkpoint["model"].keys())
+    model_checkpoint = {(_rm_prefix(k) if megatron_fsdp else k): v for k, v in checkpoint["model"].items()}
+    # Warn if keys appear prefixed but user forgot megatron_fsdp=True.
+    if (not megatron_fsdp) and any(k.startswith("module.") for k in raw_keys):
         _logger.warning(
             f"Checkpoint state dictionary keys ({first_key}) may be prefixed "
             "with 'module.' if converted from a Megatron-FSDP DCP checkpoint."
             "Set megatron_fsdp=True to automatically strip the prefix."
         )

56-70: Robust DCP load: support None objects and load stateful objects directly.

Current code will crash when optimizer is None and unnecessarily materializes state_dicts.

-def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None):
+def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None):
     """Load a Torch DCP checkpoint from checkpoint_path into model and optimizer.
 
     Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
     """
-    # Load model and optimizer checkpoints.
-    state_dict = {}
-    if model is not None:
-        state_dict["model"] = model.state_dict()
-    if optimizer is not None:
-        state_dict["optimizer"] = optimizer.state_dict()
-    torch.distributed.checkpoint.load(state_dict, checkpoint_id=checkpoint_path)
-    model.load_state_dict(state_dict["model"])
-    optimizer.load_state_dict(state_dict["optimizer"])
+    state = {}
+    if model is not None:
+        state["model"] = model
+    if optimizer is not None:
+        state["optimizer"] = optimizer
+    torch.distributed.checkpoint.load(state, checkpoint_id=checkpoint_path)

110-112: Guard rank-0 logging; get_rank() raises if PG isn’t initialized.

-            if torch.distributed.get_rank() == 0:
+            if torch.distributed.is_available() and torch.distributed.is_initialized() and torch.distributed.get_rank() == 0:
                 _logger.info(f"Loaded latest model and optimizer checkpoints from: {latest_subdir}")

153-162: Avoid world-writable 0o777 on checkpoints; use 0o775 and guard logging.

0o777 is risky on shared filesystems.

-        # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root.
-        mode = 0o777
+        # Relax checkpoint permissions for collaboration; avoid world-writable.
+        mode = 0o775
         for dirpath, _, filenames in os.walk(ckpt_dir):
             # Change current directory perms.
             os.chmod(dirpath, mode)
             for filename in filenames:
                 # Change file perms.
                 file_path = Path(dirpath) / filename
                 os.chmod(file_path, mode)
-        if torch.distributed.get_rank() == 0:
+        if torch.distributed.is_available() and torch.distributed.is_initialized() and torch.distributed.get_rank() == 0:
             _logger.info(f"Saved validated checkpoint to: {ckpt_dir}")

recipes/vit/config/vit_base_patch16_224.yaml (3)

76-76: num_classes must match dataset (5); 100000 will break head shape.

-  num_classes: 100000
+  num_classes: 5

46-56: Align FSDP strategy types with defaults (strings, not ints).

Hydra/OmegaConf type mismatches will cause parsing/branching issues.

   init_model_with_meta_device: true
-  zero_dp_strategy: 3
+  zero_dp_strategy: "optim_grads_params"
   fsdp_unit_modules:
     - vit.Block
     - vit.PatchEmbed
     - torch.nn.LayerNorm
     - torch.nn.Linear
   use_hybrid_fsdp: true
-  outer_dp_sharding_strategy: 1
+  outer_dp_sharding_strategy: "optim"
   grad_reduce_in_fp32: false
   preserve_fp32_weights: true

If defaults.yaml uses different enum names, update to those exact strings.

80-90: Point dataset paths to bundled data or parameterize a data_root.

As-is, example runs will fail to find files.

Option A (absolute within repo):

-    root: "./data/super-tiny-imagenet-5/train"
-    class_map: "./data/super-tiny-imagenet-5/words.txt"
+    root: "recipes/vit/data/super-tiny-imagenet-5/train"
+    class_map: "recipes/vit/data/super-tiny-imagenet-5/words.txt"
...
-    root: "./data/super-tiny-imagenet-5/val"
-    class_map: "./data/super-tiny-imagenet-5/words.txt"
-    label_map: "./data/super-tiny-imagenet-5/val/val_annotations.txt"
+    root: "recipes/vit/data/super-tiny-imagenet-5/val"
+    class_map: "recipes/vit/data/super-tiny-imagenet-5/words.txt"
+    label_map: "recipes/vit/data/super-tiny-imagenet-5/val/val_annotations.txt"

Option B (preferred: configurable root):

+dataset:
+  data_root: "recipes/vit/data/super-tiny-imagenet-5"
   train:
-    root: "./data/super-tiny-imagenet-5/train"
-    class_map: "./data/super-tiny-imagenet-5/words.txt"
+    root: "${dataset.data_root}/train"
+    class_map: "${dataset.data_root}/words.txt"
   val:
-    root: "./data/super-tiny-imagenet-5/val"
-    class_map: "./data/super-tiny-imagenet-5/words.txt"
-    label_map: "./data/super-tiny-imagenet-5/val/val_annotations.txt"
+    root: "${dataset.data_root}/val"
+    class_map: "${dataset.data_root}/words.txt"
+    label_map: "${dataset.data_root}/val/val_annotations.txt"

🧹 Nitpick comments (4)

recipes/vit/test_infer.py (2)
18-21: Import torch if you gate the test on CUDA.

Required for skip marker below.
 import pytest
 from hydra import compose, initialize_config_dir
 from torch.distributed.checkpoint.format_utils import dcp_to_torch_save
+import torch
62-63: Assert checkpoint conversion succeeded for a minimal correctness signal.
 dcp_to_torch_save(Path(tmp_path) / "test_infer_dcp_checkpoint", test_ckpt_path)
+assert test_ckpt_path.exists()
recipes/vit/checkpoint.py (1)
117-129: Optional: Save stateful objects directly with DCP.

Keeps save/load symmetric and lets sharded state handle itself.
-    state_dict = {}
-    if model is not None:
-        state_dict["model"] = model.state_dict()
-    if optimizer is not None:
-        state_dict["optimizer"] = optimizer.state_dict()
-    torch.distributed.checkpoint.save(state_dict, checkpoint_id=checkpoint_path)
+    state = {}
+    if model is not None:
+        state["model"] = model
+    if optimizer is not None:
+        state["optimizer"] = optimizer
+    torch.distributed.checkpoint.save(state, checkpoint_id=checkpoint_path)
recipes/vit/config/vit_base_patch16_224.yaml (1)
67-75: Default inference flag megatron_fsdp: consider false for quickstart.

Most new users will run non‑Megatron checkpoints first; setting true by default can confuse.
     format: "torch"
-    megatron_fsdp: true
+    megatron_fsdp: false

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 44e160b and 802d193.

📒 Files selected for processing (6)

recipes/vit/checkpoint.py (1 hunks)
recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)
recipes/vit/infer.py (1 hunks)
recipes/vit/test_infer.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

recipes/vit/infer.py
recipes/vit/config/vit_te_base_patch16_224.yaml
recipes/vit/config/defaults.yaml

🧰 Additional context used

🧬 Code graph analysis (1)

recipes/vit/test_infer.py (4)

recipes/vit/checkpoint.py (1)

save_dcp_checkpoint (117-128)

recipes/vit/distributed.py (1)

initialize_distributed (23-77)

recipes/vit/infer.py (1)

main (30-51)

recipes/vit/vit.py (1)

build_vit_model (70-98)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

recipes/vit/test_infer.py

codecov-commenter · 2025-09-04T17:57:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.57%. Comparing base (664a9b9) to head (260c5a5).
⚠️ Report is 6 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1103      +/-   ##
==========================================
- Coverage   79.57%   79.57%   -0.01%     
==========================================
  Files         160      160              
  Lines       11826    11826              
==========================================
- Hits         9411     9410       -1     
- Misses       2415     2416       +1

see 1 file with indirect coverage changes

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (8)

recipes/vit/checkpoint.py (2)

57-73: Simplify DCP checkpoint loading by passing stateful objects directly

The current implementation unnecessarily creates intermediate state dicts. DCP supports loading directly into stateful objects.

Apply this diff:

 def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None):
     """Load a Torch DCP checkpoint from checkpoint_path into model and optimizer.
 
     Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
     """
-    # Load model and optimizer checkpoints.
-    state_dict = {}
-    if model is not None:
-        state_dict["model"] = model.state_dict()
-    if optimizer is not None:
-        state_dict["optimizer"] = optimizer.state_dict()
-    torch.distributed.checkpoint.load(state_dict, checkpoint_id=checkpoint_path)
-    if model is not None:
-        model.load_state_dict(state_dict["model"])
-    if optimizer is not None:
-        optimizer.load_state_dict(state_dict["optimizer"])
+    state = {}
+    if model is not None:
+        state["model"] = model
+    if optimizer is not None:
+        state["optimizer"] = optimizer
+    if state:  # Only load if there's something to load
+        torch.distributed.checkpoint.load(state, checkpoint_id=checkpoint_path)

156-163: Overly permissive 0777 permissions on checkpoints

World-writable permissions (0777) are a security risk on shared filesystems. Consider using more restrictive permissions.

Apply this diff:

-        # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root.
-        mode = 0o777
+        # Relax checkpoint permissions for collaboration; avoid world-writable.
+        mode = 0o775
         for dirpath, _, filenames in os.walk(ckpt_dir):
             # Change current directory perms.
             os.chmod(dirpath, mode)
             for filename in filenames:
                 # Change file perms.
                 file_path = Path(dirpath) / filename
                 os.chmod(file_path, mode)

recipes/vit/distributed.py (1)

45-51: Initialize with explicit backend and fix LOCAL_RANK fallback

The current implementation has several issues:

Backend should be explicit (NCCL for CUDA)
Fallback to global rank for LOCAL_RANK is incorrect on multi-node setups
Missing CUDA availability check and device bounds validation

Apply this diff:

-    torch.distributed.init_process_group()
+    if not torch.cuda.is_available():
+        raise RuntimeError("CUDA is required for this recipe; no GPU visible.")
+    torch.distributed.init_process_group(backend="nccl", init_method="env://")

-    local_rank = int(os.getenv("LOCAL_RANK", torch.distributed.get_rank()))
-    torch.cuda.set_device(local_rank)
+    local_rank = int(os.getenv("LOCAL_RANK", -1))
+    if local_rank < 0:
+        world_rank = torch.distributed.get_rank()
+        local_rank = world_rank % torch.cuda.device_count()
+    assert 0 <= local_rank < torch.cuda.device_count(), (
+        f"Computed local_rank {local_rank} out of range for {torch.cuda.device_count()} visible devices."
+    )
+    torch.cuda.set_device(local_rank)

recipes/vit/vit.py (2)

1556-1556: Positional embedding calls need null guards.

Previous reviews correctly identified that self.pos_embed(x) can fail when pos_embed=False since self.pos_embed would be None.

Also applies to: 1640-1640

482-486: Buffer registration issue with qkv_bias parameter.

The code registers None as a buffer when qkv_bias=True, which is incorrect. Buffers should be tensor objects.

recipes/vit/README.md (2)

26-31: Remove incorrect PIP_CONSTRAINT advice; show venv-based install.

-```
-cd recipes/vit
-# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
-# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
-pip install -r requirements.txt
-```
+```bash
+cd recipes/vit
+# Recommended: create and activate a clean virtual environment (or use the provided Docker image)
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+```

41-41: Fix DCP docs host to pytorch.org (not docs.pytorch.org).

-... save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) ...
+... save auto-resumable [PyTorch Distributed Checkpoint (DCP)](https://pytorch.org/docs/stable/distributed.checkpoint.html) ...

-    Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
+    Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html

-    Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
+    Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html

Also applies to: 62-63, 76-77

recipes/vit/train.py (1)

183-187: Resume data iteration correctly—enumerate(start=...) doesn’t skip items.

Current code reprocesses batches after resume. Use islice to drop the first N items.

-        for batch_idx, sample in enumerate(
-            # Skip to latest step.
-            infinite_dataloader(train_dataloader, train_sampler),
-            start=latest_step_idx,
-        ):
+        import itertools
+        # Skip already-processed steps to avoid duplication on resume.
+        train_iter = itertools.islice(
+            infinite_dataloader(train_dataloader, train_sampler),
+            latest_step_idx,
+            None,
+        )
+        for batch_idx, sample in enumerate(train_iter, start=latest_step_idx):

🧹 Nitpick comments (11)

recipes/vit/distributed.py (2)

56-59: Fix validation error message to exclude tp from product

The error message incorrectly includes tp in the product calculation, but the validation only checks dp_outer * dp_shard * cp.

Apply this diff:
-            f"Invalid parallelism sizes: dp_outer({dp_outer}) * dp_shard({dp_shard}) * cp({cp}) * tp({tp}) != world_size({torch.distributed.get_world_size()})"
+            f"Invalid parallelism sizes: dp_outer({dp_outer}) * dp_shard({dp_shard}) * cp({cp}) != world_size({torch.distributed.get_world_size()}) [tp={tp} is separate]"
74-79: Consider adding type hints for better code clarity

Adding type hints would improve code maintainability and IDE support.

Consider adding type hints to the function signature:
+from typing import Generator
+from torch.distributed.device_mesh import DeviceMesh
+
 @contextmanager
 def initialize_distributed(
     dp_outer: int = 1,
     dp_shard: int = 1,
     cp: int = 1,
     tp: int = 1,
-):
+) -> Generator[DeviceMesh, None, None]:

recipes/vit/checkpoint.py (1)

99-110: Potential IndexError when parsing checkpoint directory names

The code assumes checkpoint directories follow the exact format step_<idx>_loss_<value> but doesn't validate this, which could cause IndexErrors if directories don't match the expected format.

Add validation for directory name format:

             # We expect a checkpoint named as: step_<step_idx>_loss_<loss_value>.
             # Get the latest step, the directory with the most recent modification time.
             opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1
+            
+            def parse_checkpoint_dir(path):
+                """Parse checkpoint directory name, return None if invalid format."""
+                try:
+                    parts = path.name.split("_")
+                    if len(parts) >= 4 and parts[0] == "step" and parts[2] == "loss":
+                        return int(parts[1]), float(parts[3])
+                except (ValueError, IndexError):
+                    pass
+                return None
+            
+            valid_subdirs = [(d, parse_checkpoint_dir(d)) for d in subdirs]
+            valid_subdirs = [(d, parsed) for d, parsed in valid_subdirs if parsed is not None]
+            
+            if not valid_subdirs:
+                _logger.warning("No valid checkpoint directories found")
+                return 0
+                
             latest_subdir = max(
-                subdirs,
+                valid_subdirs,
                 key=lambda x: (
-                    opt_metric_coeff * float(x.name.split("_")[3])
+                    opt_metric_coeff * x[1][1]
                     if cfg.training.checkpoint.resume_from_metric
                     else 0,
-                    x.stat().st_mtime,
+                    x[0].stat().st_mtime,
                 ),
-            )
+            )[0]

recipes/vit/config/defaults.yaml (1)

50-50: Consider documenting FSDP zero_dp_strategy options

The zero_dp_strategy: "optim_grads_params" setting affects memory usage significantly. Consider adding a comment explaining the available options.

Add documentation for clarity:
   init_model_with_meta_device: true
-  zero_dp_strategy: "optim_grads_params"
+  zero_dp_strategy: "optim_grads_params"  # Options: "none", "optim", "grads", "optim_grads", "params", "optim_params", "grads_params", "optim_grads_params"

recipes/vit/README.md (2)

5-5: Heading level jump (MD001).

Promote “Pre-Requisites” to H2 so levels increment by one.

-### Pre-Requisites
+## Prerequisites

11-13: Add languages to fenced code blocks (MD040).

Specify bash/python for better rendering and copy/paste UX.

-```
+```bash
 docker build -t <image_repo>:<image_tag> .

- +bash

Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!

docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>


-```
+```bash
torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit

- +text
python -m torch.distributed.checkpoint.format_utils --help
usage: format_utils.py [-h] {torch_to_dcp,dcp_to_torch} src dst
@@


-```
+```bash
python -m torch.distributed.checkpoint.format_utils dcp_to_torch step_75_loss_1.725 torch_ckpt_test.pt

-python +python
from torch.distributed.checkpoint.format_utils import (
dcp_to_torch_save,
torch_save_to_dcp,
)
@@


-```python
+```python
from checkpoint import load_torch_checkpoint
@@



Also applies to: 17-20, 37-39, 105-117, 121-138, 145-157

</blockquote></details>
<details>
<summary>recipes/vit/train.py (3)</summary><blockquote>

`127-136`: **Enable pinned memory for faster H2D; guard persistent_workers (already done).**

Pin host memory and use non_blocking copies below.

```diff
         train_dataloader = DataLoader(
             beans_train_dataset,
             batch_size=cfg.dataset.train.batch_size,
             sampler=train_sampler,
             num_workers=cfg.dataset.num_workers,
@@
-            persistent_workers=(cfg.dataset.num_workers > 0),
+            persistent_workers=(cfg.dataset.num_workers > 0),
+            pin_memory=True,
         )
@@
         val_dataloader = DataLoader(
             beans_val_dataset,
             batch_size=cfg.dataset.val.batch_size,
             sampler=val_sampler,
             num_workers=cfg.dataset.num_workers,
@@
-            persistent_workers=(cfg.dataset.num_workers > 0),
+            persistent_workers=(cfg.dataset.num_workers > 0),
+            pin_memory=True,
         )

Also update the H2D transfers:

-            input = input.cuda()
-            target = target.cuda()
+            input = input.cuda(non_blocking=True)
+            target = target.cuda(non_blocking=True)

Also applies to: 149-158

193-201: Zero gradients with set_to_none=True.

Better perf and avoids residual grads.

-            optimizer.zero_grad()
+            optimizer.zero_grad(set_to_none=True)

254-255: Optional: synchronize before timing for stable throughput.

Ensures CUDA kernels complete before measuring.

-                t_end = time.perf_counter()
+                torch.cuda.synchronize()
+                t_end = time.perf_counter()

recipes/vit/beans.py (1)

27-35: infinite_dataloader is fine; consider exposing start offset.

If other callers need resume-on-step, optionally accept a start parameter and skip via itertools.islice.

-def infinite_dataloader(dataloader, sampler):
+def infinite_dataloader(dataloader, sampler, *, start: int = 0):
@@
-    while True:
+    while True:
         sampler.set_epoch(epoch)  # Update epoch for proper shuffling
-        for batch in dataloader:
+        for i, batch in enumerate(dataloader):
+            if epoch == 0 and i < start:
+                continue
             yield batch

recipes/vit/config/vit_base_patch16_224.yaml (1)

68-76: Inference defaults: consider megatron_fsdp=false when format="torch".

Safer OOTB since non-distributed torch checkpoints don’t require Megatron-FSDP.

   checkpoint:
     path: null
     # Load a DCP->Torch converted checkpoint for inference without Megatron-FSDP.
     # Otherwise, set this to "torch_dcp" if using Megatron-FSDP for inference.
     # If the checkpoint was not trained with Megatron-FSDP, then set megatron_fsdp to false.
     format: "torch"
-    megatron_fsdp: true
+    megatron_fsdp: false

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d89d170 and c09c4bb.

📒 Files selected for processing (17)

recipes/vit/.gitignore (1 hunks)
recipes/vit/.ruff.toml (1 hunks)
recipes/vit/Dockerfile (1 hunks)
recipes/vit/LICENSE (1 hunks)
recipes/vit/README.md (1 hunks)
recipes/vit/beans.py (1 hunks)
recipes/vit/checkpoint.py (1 hunks)
recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)
recipes/vit/distributed.py (1 hunks)
recipes/vit/infer.py (1 hunks)
recipes/vit/requirements.txt (1 hunks)
recipes/vit/test_infer.py (1 hunks)
recipes/vit/test_train.py (1 hunks)
recipes/vit/train.py (1 hunks)
recipes/vit/vit.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

recipes/vit/.gitignore

🚧 Files skipped from review as they are similar to previous changes (6)

recipes/vit/.ruff.toml
recipes/vit/Dockerfile
recipes/vit/infer.py
recipes/vit/test_infer.py
recipes/vit/config/vit_te_base_patch16_224.yaml
recipes/vit/test_train.py

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-09T23:57:34.120Z

Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1139
File: models/geneformer/src/geneformer/convert.py:1-3
Timestamp: 2025-09-09T23:57:34.120Z
Learning: NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier instead of the standard "Apache-2.0" identifier in their codebase header files.

Applied to files:

recipes/vit/checkpoint.py

📚 Learning: 2025-08-28T16:40:04.315Z

Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1078
File: recipes/esm2_native_te_mfsdp/train_ddp.py:103-108
Timestamp: 2025-08-28T16:40:04.315Z
Learning: PyTorch DistributedDataParallel constructor accepts a device_mesh parameter in recent versions, which supports advanced distributed training scenarios and nvFSDP configurations.

Applied to files:

recipes/vit/distributed.py

🧬 Code graph analysis (1)

recipes/vit/train.py (4)

recipes/vit/beans.py (2)

BeansDataset (37-65)

infinite_dataloader (27-34)

recipes/vit/checkpoint.py (2)

load_auto_resume_checkpoint (75-117)

save_auto_resumable_checkpoint (134-165)

recipes/vit/distributed.py (1)

initialize_distributed (23-85)

recipes/vit/vit.py (1)

build_vit_model (77-105)

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

121-121: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

159-159: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests (recipes/vit, svcbionemo023/bionemo-framework:pytorch25.06-py3-squashed-zstd)

🔇 Additional comments (12)

recipes/vit/LICENSE (1)

1-224: LGTM! Dual licensing properly documented

The dual licensing (Apache 2.0 and MIT) is clearly presented with appropriate copyright attributions for both Ross Wightman (2019) and AIR Lab Makerere University (2020).

recipes/vit/checkpoint.py (3)

2-2: Using non-standard SPDX license identifier

Based on the retrieved learning, NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier in their codebase, so this is correct for this repository.

43-43: LGTM! Using removeprefix correctly

The code correctly uses removeprefix("module.") which only removes the exact prefix if present, avoiding the issues that would arise from using strip().

113-114: LGTM! Proper rank-0 logging guards

The code correctly checks if the distributed process group is initialized before calling get_rank(), preventing potential runtime errors.

Also applies to: 164-165

recipes/vit/config/defaults.yaml (2)

1-92: LGTM! Clean configuration structure with dataset-driven class count

The configuration properly uses ${dataset.num_classes} to drive the model's output dimension from the dataset configuration, avoiding hardcoded values and ensuring consistency.

77-77: Consider whether num_classes=3 is the intended default

The default num_classes: 3 seems unusual. Most standard datasets have different class counts (e.g., ImageNet has 1000, CIFAR-10 has 10). Verify if this is intentional or should be adjusted.

Is num_classes: 3 the correct default for the super-tiny-imagenet dataset mentioned in the PR description, or should this be 5 to match the "5 classes × 50 training images" mentioned?

recipes/vit/vit.py (3)

69-75: TransformerEngine import handling is properly defensive.

The try/except block correctly handles the optional TransformerEngine dependency and sets the appropriate flag for conditional usage throughout the code.

77-105: Model factory function handles edge cases well.

The build_vit_model function properly handles meta device initialization, TransformerEngine configuration, and channels_last memory format. The assertions and conditional logic are appropriate.

1264-1306: VisionTransformer constructor parameter validation is thorough.

The assertions for global_pool values, class_token requirements, and dimension divisibility provide good validation. The parameter documentation is comprehensive.
recipes/vit/README.md (1)
3-3: Fix TIMM capitalization.

Use “TIMM” (or “timm”), not “TImM”.
[raise_nitpick_refactor]
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_
+_Adapted ViT model code from huggingface/pytorch-image-models (TIMM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_
recipes/vit/beans.py (1)

42-55: Rank-0 logging looks good.

Init and metadata logging are correct and safe under DDP.

recipes/vit/config/vit_base_patch16_224.yaml (1)

77-86: num_classes wiring looks correct.

Model uses ${dataset.num_classes}; dataset.num_classes=3 matches Beans.

Also applies to: 10-10

recipes/vit/config/vit_base_patch16_224.yaml

recipes/vit/requirements.txt

recipes/vit/train.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (7)

recipes/vit/test_infer.py (2)

28-38: Skip on CPU-only runners or support CPU.

The test forces CUDA usage which will cause failures on CPU-only runners.

Apply this diff to guard the test:
+import torch
+
 @pytest.mark.parametrize("config_name", ["vit_base_patch16_224", "vit_te_base_patch16_224"])
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="requires CUDA")
 def test_infer(monkeypatch, tmp_path, config_name):
22-25: Imports rely on current working directory; make them package-relative.

The imports use bare module names which will break when pytest runs from the project root. Additionally, the import from infer calls main which is decorated with @hydra.main and expects zero args.

Apply this diff to fix the import and calling issues:
-from checkpoint import save_dcp_checkpoint
-from distributed import initialize_distributed
-from infer import main
-from vit import build_vit_model
+from .checkpoint import save_dcp_checkpoint
+from .distributed import initialize_distributed
+from .infer import run_inference  # assuming run_inference is extracted from main
+from .vit import build_vit_model
Also, Line 65 calls main(vit_config) but main is decorated with @hydra.main and expects zero arguments. The inference logic should be extracted into an undecorated run_inference(cfg) function.

Check if run_inference function exists or if infer.py needs refactoring:
#!/bin/bash
# Check if run_inference function exists in infer.py
rg -n "def run_inference" recipes/vit/infer.py || echo "run_inference function not found"
rg -n "@hydra.main" recipes/vit/infer.py -A 3

recipes/vit/README.md (2)

26-31: Remove incorrect PIP_CONSTRAINT guidance; recommend venv instead.

The guidance suggesting PIP_CONSTRAINT= to ignore dependency conflicts is incorrect. PIP_CONSTRAINT requires a constraints file path, and setting it empty doesn't "ignore conflicts."

Apply this diff to fix the installation guidance:

-```
+```bash
 cd recipes/vit
-# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
-# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
+# Recommended: create and activate a clean virtual environment (or use the provided Docker image)
+python -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt


---

`41-41`: **Fix PyTorch DCP docs URL (domain is pytorch.org, not docs.pytorch.org).**



The PyTorch documentation links point to a non-canonical host.



Apply this diff to fix the URLs throughout the document:

```diff
-which will train on the [`AI-Lab-Makerere/ibean`](https://github.com/AI-Lab-Makerere/ibean/) (HuggingFace: [`AI-Lab-Makerere/beans`](https://huggingface.co/datasets/AI-Lab-Makerere/beans)) dataset and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
+which will train on the [`AI-Lab-Makerere/ibean`](https://github.com/AI-Lab-Makerere/ibean/) (HuggingFace: [`AI-Lab-Makerere/beans`](https://huggingface.co/datasets/AI-Lab-Makerere/beans)) dataset and save auto-resumable [PyTorch Distributed Checkpoint (DCP)](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.

Also fix at Lines 62-63 and 76-77:

-    Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
+    Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html

recipes/vit/train.py (1)

183-187: Fix dataloader resume logic to avoid duplicating/skipping samples.

Passing start=latest_step_idx to enumerate only shifts the batch index—it doesn't advance the iterator. This means when resuming from a checkpoint, the model will re-process samples that were already seen before the checkpoint.

The current implementation can lead to:

Data duplication when resuming training

Non-deterministic training results

Incorrect learning curves

Apply this diff to fix the dataloader resume logic:
+        resume_epoch = latest_step_idx // steps_per_epoch if steps_per_epoch > 0 else 0
+        train_sampler.set_epoch(resume_epoch)
+        
+        # Create iterator and skip already processed batches
+        dataloader_iter = iter(infinite_dataloader(train_dataloader, train_sampler))
+        for _ in range(latest_step_idx):
+            next(dataloader_iter, None)
+            
         for batch_idx, sample in enumerate(
-            # Skip to latest step.
-            infinite_dataloader(train_dataloader, train_sampler),
-            start=latest_step_idx,
+            dataloader_iter,
+            start=latest_step_idx,
         ):

recipes/vit/test_train.py (2)

18-21: Guard runtime deps (CUDA/TE) and prep for ephemeral port use — add imports.

To support the skip guards and a free port, add these imports.

 from pathlib import Path
+import importlib.util
+import socket
+import torch
 
 import pytest
 from hydra import compose, initialize_config_dir

30-35: Skip when CUDA/TE unavailable and avoid hardcoded port.

Prevents spurious CI failures and port collisions on shared runners.

-    # Set required environment variables for distributed training
+    # Runtime guards
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA is required for this test.")
+    if config_name.startswith("vit_te_") and importlib.util.find_spec("transformer_engine") is None:
+        pytest.skip("TransformerEngine not installed; skipping TE config.")
+
+    # Set required environment variables for distributed training
     monkeypatch.setenv("LOCAL_RANK", "0")
     monkeypatch.setenv("RANK", "0")
     monkeypatch.setenv("WORLD_SIZE", "1")
-    monkeypatch.setenv("MASTER_ADDR", "localhost")
-    monkeypatch.setenv("MASTER_PORT", "29500")
+    monkeypatch.setenv("MASTER_ADDR", "127.0.0.1")
+    with socket.socket() as s:
+        s.bind(("127.0.0.1", 0))
+        monkeypatch.setenv("MASTER_PORT", str(s.getsockname()[1]))

🧹 Nitpick comments (5)

recipes/vit/README.md (1)
11-20: Add language specification to fenced code blocks.

The markdown linter indicates several fenced code blocks are missing language specifications.

Apply this diff to add language specifications:
-```
+```bash
 docker build -t <image_repo>:<image_tag> .
To launch a Docker container from the image, run the following command:

- +bash

Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!

docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
Also apply to Lines 37-37, 105-105, and 121-121 by adding bash language specification.
recipes/vit/test_train.py (4)
41-52: Make dataloading hermetic for tests (no worker procs).

Avoids persistent_workers and multiprocessing overhead/flakiness in CI.
             overrides=[
                 "++training.steps=5",
                 "++training.val_interval=5",
                 "++training.log_interval=1",
                 f"++training.checkpoint.path={training_ckpt_path}",
                 "++profiling.torch_memory_profile=false",
                 "++profiling.wandb=false",
                 f"++fsdp.init_model_with_meta_device={init_model_with_meta_device}",
+                "++dataset.num_workers=0",
             ],
56-63: Loosen checkpoint assertion and verify auto-resume is idempotent.

Exact directory count can be config/impl-dependent; assert ≥1 and ensure the second run doesn’t create extras.
-    # Verify checkpoints were created.
-    assert sum(1 for item in training_ckpt_path.iterdir() if item.is_dir()) == 1, (
-        "Expected 1 checkpoint with 5 training steps and validation interval of 5."
-    )
-
-    # Auto-resume training from checkpoint. For this test, we auto-resume from the best checkpoint.
-    main(vit_config)
+    # Verify checkpoints were created.
+    num_ckpts_before = sum(1 for item in training_ckpt_path.iterdir() if item.is_dir())
+    assert num_ckpts_before >= 1, f"Expected at least 1 checkpoint; found {num_ckpts_before}."
+
+    # Auto-resume training from checkpoint. For this test, resume from the best/latest checkpoint.
+    main(vit_config)
+
+    # Ensure auto-resume doesn't create additional checkpoint directories.
+    num_ckpts_after = sum(1 for item in training_ckpt_path.iterdir() if item.is_dir())
+    assert num_ckpts_after == num_ckpts_before, "Auto-resume should not create extra checkpoint dirs."
21-21: Import path robustness for train.main.

Depending on how pytest sets sys.path, from train import main can resolve the wrong module. Consider a more explicit import (package path) or file‑scoped import via importlib. Optional if CI is stable.
-from train import main
+# Option A (package import): from recipes.vit.train import main
+# Option B (file import inside test):
+#   spec = importlib.util.spec_from_file_location("vit_train", Path(__file__).parent / "train.py")
+#   mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod); main = mod.main
1-63: Optional: add a smoke 2‑GPU test gated by availability.

Helps catch regressions in DP=2 without burdening CI on 1‑GPU runners.
@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="Requires >=2 GPUs")
def test_train_dp2_smoke(tmp_path, monkeypatch):
    monkeypatch.setenv("LOCAL_RANK", "0")
    monkeypatch.setenv("RANK", "0")
    monkeypatch.setenv("WORLD_SIZE", "2")
    monkeypatch.setenv("MASTER_ADDR", "127.0.0.1")
    with socket.socket() as s:
        s.bind(("127.0.0.1", 0))
        monkeypatch.setenv("MASTER_PORT", str(s.getsockname()[1]))
    # Compose config with dp_shard=2 and very small work.
    # (Reuse compose/initialize_config_dir as above, set steps=2, val_interval=2, num_workers=0.)
    # Call main(cfg) and assert it completes without exception.
Want me to push a patch adding this guarded smoke test?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91188c0 and c09c4bb.

📒 Files selected for processing (17)

recipes/vit/.gitignore (1 hunks)
recipes/vit/.ruff.toml (1 hunks)
recipes/vit/Dockerfile (1 hunks)
recipes/vit/LICENSE (1 hunks)
recipes/vit/README.md (1 hunks)
recipes/vit/beans.py (1 hunks)
recipes/vit/checkpoint.py (1 hunks)
recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)
recipes/vit/distributed.py (1 hunks)
recipes/vit/infer.py (1 hunks)
recipes/vit/requirements.txt (1 hunks)
recipes/vit/test_infer.py (1 hunks)
recipes/vit/test_train.py (1 hunks)
recipes/vit/train.py (1 hunks)
recipes/vit/vit.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (13)

recipes/vit/requirements.txt
recipes/vit/config/defaults.yaml
recipes/vit/.ruff.toml
recipes/vit/beans.py
recipes/vit/LICENSE
recipes/vit/Dockerfile
recipes/vit/config/vit_te_base_patch16_224.yaml
recipes/vit/distributed.py
recipes/vit/config/vit_base_patch16_224.yaml
recipes/vit/.gitignore
recipes/vit/infer.py
recipes/vit/vit.py
recipes/vit/checkpoint.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-09T23:57:34.120Z

Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1139
File: models/geneformer/src/geneformer/convert.py:1-3
Timestamp: 2025-09-09T23:57:34.120Z
Learning: NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier instead of the standard "Apache-2.0" identifier in their codebase header files.

Applied to files:

recipes/vit/test_infer.py

🧬 Code graph analysis (3)

recipes/vit/test_infer.py (4)

recipes/vit/checkpoint.py (1)

save_dcp_checkpoint (120-131)

recipes/vit/distributed.py (1)

initialize_distributed (23-85)

recipes/vit/infer.py (1)

main (30-51)

recipes/vit/vit.py (1)

build_vit_model (77-105)

recipes/vit/test_train.py (1)

recipes/vit/train.py (1)

main (42-307)

recipes/vit/train.py (4)

recipes/vit/beans.py (2)

BeansDataset (37-65)

infinite_dataloader (27-34)

recipes/vit/checkpoint.py (2)

load_auto_resume_checkpoint (75-117)

save_auto_resumable_checkpoint (134-165)

recipes/vit/distributed.py (1)

initialize_distributed (23-85)

recipes/vit/vit.py (1)

build_vit_model (77-105)

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

121-121: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

159-159: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests (recipes/vit, svcbionemo023/bionemo-framework:pytorch25.06-py3-squashed-zstd)

🔇 Additional comments (7)

recipes/vit/train.py (3)

41-112: LGTM!

The training setup is comprehensive and well-configured:

Proper Hydra integration with clear configuration

Comprehensive profiling setup (memory profiling and WandB)

Correct device mesh initialization and FSDP configuration

Proper auto-resume checkpoint loading

The detailed comments explaining the FSDP parameters are particularly helpful.

218-248: LGTM!

The validation logic is well-implemented:

Proper model evaluation mode setting

Distributed loss reduction across the HSDP group

Appropriate checkpoint saving after validation

Clean validation loop structure

294-307: LGTM!

The cleanup logic is appropriate:

Proper memory profiler snapshot dumping with TODO for modernization

Correct WandB finalization on main process only

Clean resource management

recipes/vit/test_infer.py (2)

2-2: Use NVIDIA-approved SPDX license identifier.

The retrieved learning confirms that NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier in their codebase header files. This is correct as implemented.

55-62: LGTM!

The checkpoint creation and conversion logic is well-structured:

Proper distributed context management

Correct model initialization with device mesh

Clean DCP to torch checkpoint conversion

Appropriate use of temporary paths

recipes/vit/README.md (1)

1-4: LGTM!

The header section is well-structured with proper attribution to the original TIMM work by Ross Wightman and clear acknowledgment of the adaptation.

recipes/vit/test_train.py (1)

24-26: Nice parametrization and coverage of meta-device init.

Good breadth for the resume path and config variants.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

.github/workflows/unit-tests-recipes.yml (1)
134-134: Parameterize shm size for future tuning.

Minor: make 16G configurable without editing YAML.
   unit-tests:
     needs: changed-dirs
     runs-on: linux-amd64-gpu-l4-latest-1
@@
-    container:
+    env:
+      CI_SHM_SIZE: 16G
+    container:
       image: ${{ matrix.recipe.image }}
-      options: --shm-size=16G
+      options: --shm-size=${{ env.CI_SHM_SIZE }}
LICENSE/third_party.txt (4)
874-875: Standardize “License Text:” URL + add upstream commit references.

For consistency and auditability, add “License Text:” lines pointing to the upstream LICENSE file and an “Upstream ref:” line with the exact commit SHA used.

Apply example edits:
-https://github.com/huggingface/pytorch-image-models
+License Text: https://github.com/huggingface/pytorch-image-models/blob/main/LICENSE
+Upstream ref: https://github.com/huggingface/pytorch-image-models/commit/<sha>

-https://github.com/google-research/vision_transformer
-https://github.com/google-research/big_vision
+License Text: https://github.com/google-research/vision_transformer/blob/main/LICENSE
+License Text: https://github.com/google-research/big_vision/blob/main/LICENSE
+Upstream ref: https://github.com/google-research/vision_transformer/commit/<sha>, https://github.com/google-research/big_vision/commit/<sha>

-https://github.com/AI-Lab-Makerere/ibean
+License Text: https://github.com/AI-Lab-Makerere/ibean/blob/master/LICENSE
+Upstream ref: https://github.com/AI-Lab-Makerere/ibean/commit/<sha>

-https://github.com/lucidrains/vit-pytorch
+License Text: https://github.com/lucidrains/vit-pytorch/blob/main/LICENSE
+Upstream ref: https://github.com/lucidrains/vit-pytorch/commit/<sha>

-https://github.com/karpathy/minGPT
+License Text: https://github.com/karpathy/minGPT/blob/master/LICENSE
+Upstream ref: https://github.com/karpathy/minGPT/commit/<sha>
Also applies to: 1079-1082, 1286-1290, 1311-1313, 1336-1338

874-1285: Reduce redundant Apache-2.0 full-text blocks.

Multiple full Apache texts bloat the file. Keep one canonical Apache-2.0 text and list all covered projects beneath it, unless legal requires embedding each upstream’s exact block. If you keep duplicates, at least drop the Appendix sections to avoid placeholder confusion.

1056-1076: Appendix placeholders can confuse consumers.

These “[yyyy]/[name of copyright owner]” and “{}” placeholder appendices are boilerplate for authors, not third-party attribution. Trim them to prevent misinterpretation.

Also applies to: 1260-1284

874-1346: Add automated license scanning to CI for drift and completeness.

Integrate a lightweight step (e.g., scancode-toolkit, pip-licenses, or reuse) to ensure all transitive deps in recipes/vit/requirements.txt and Docker images are represented here, and to catch future drift.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c09c4bb and cf087af.

📒 Files selected for processing (3)

.github/workflows/unit-tests-recipes.yml (1 hunks)
LICENSE/third_party.txt (1 hunks)
recipes/vit/LICENSE (1 hunks)

✅ Files skipped from review due to trivial changes (1)

recipes/vit/LICENSE

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (3)

.github/workflows/unit-tests-recipes.yml (1)
134-134: Good call increasing /dev/shm for ViT/TE tests.

This should mitigate DataLoader multiprocessing and NCCL shared‑mem pressure in containerized runs.

To be safe on self‑hosted GPU runners, confirm Docker honors this flag (some setups inject their own runopts). You can add a quick check step:
       - name: Show GPU info
         run: nvidia-smi
+      - name: Verify container /dev/shm size
+        run: |
+          echo "Expecting /dev/shm ~= 16G"
+          df -h /dev/shm || true
LICENSE/third_party.txt (2)

1336-1346: Keep minGPT license entry; attribution exists in recipes/vit/vit.py

recipes/vit/vit.py contains "inspired by Andrej Karpathy's minGPT" (≈line 36), so retain the LICENSE/third_party.txt entry; add per-file header/NOTICE only if you substantially adapted code.

1286-1310: Clarify Beans dataset license — Hugging Face card is blank; iBean repo is MIT.

Finding: Hugging Face dataset card for AI-Lab-Makerere/beans lists no license; the iBean GitHub repository LICENSE is MIT.

Action (file: LICENSE/third_party.txt, lines 1286–1310): either replace/supplement the MIT block with the dataset's actual license or add an explicit note that the Hugging Face dataset has no license listed and the MIT text refers only to the iBean repo code. Confirm the dataset license with the dataset owner if unclear.

LICENSE/third_party.txt

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

recipes/vit/vit.py (2)

515-524: Bias concatenation order is wrong (breaks attention/MLP); also avoid registering None buffers.

self.in_split is [mlp_hidden_dim, dim, dim, dim], but the bias is concatenated as (qkv_bias, mlp_bias). This assigns the MLP bias to QKV and zero to MLP — functional bug. Also, register_buffer(..., None) / register_parameter(..., None) is unnecessary here.

Apply this diff:

         self.in_split = [mlp_hidden_dim] + [dim] * 3
-        if qkv_bias:
-            self.register_buffer("qkv_bias", None)
-            self.register_parameter("mlp_bias", None)
+        if qkv_bias:
+            # Use in-proj's bias for both MLP and QKV when qkv_bias=True
+            self.qkv_bias = None
+            self.mlp_bias = None
         else:
             self.register_buffer("qkv_bias", torch.zeros(3 * dim), persistent=False)
             self.mlp_bias = nn.Parameter(torch.zeros(mlp_hidden_dim))
@@
-        if self.mlp_bias is not None:
+        if self.mlp_bias is not None:
             # Concat constant zero-bias for qkv w/ trainable mlp_bias.
             # Appears faster than adding to x_mlp separately
-            y = F.linear(y, self.in_proj.weight, torch.cat((self.qkv_bias, self.mlp_bias)))
+            # Order must match self.in_split = [mlp, q, k, v]
+            y = F.linear(y, self.in_proj.weight, torch.cat((self.mlp_bias, self.qkv_bias)))
         else:
             y = self.in_proj(y)

Also applies to: 540-549, 542-547

1593-1596: Guard pos_embed when disabled.

When pos_embed=False, self.pos_embed is None; current calls will crash.

Apply this diff:

-        x = self.pos_embed(x)
+        if self.pos_embed is not None:
+            x = self.pos_embed(x)
@@
-        x = self.pos_embed(x)
+        if self.pos_embed is not None:
+            x = self.pos_embed(x)

Also applies to: 1677-1680

🧹 Nitpick comments (2)

recipes/vit/vit.py (2)

132-138: Fail fast when TE is requested but not installed.

Currently it silently falls back to non‑TE. Raise a clear error.

Apply this diff:

-        if cfg.model.transformer_engine and _TE_INSTALLED:
+        if cfg.model.transformer_engine and _TE_INSTALLED:
             assert device_mesh is not None, "[build_model] device_mesh is required when using TransformerEngine."
             vit_kwargs["block_fn"] = TransformerLayer
             vit_kwargs["micro_batch_size"] = cfg.dataset.train.batch_size
             vit_kwargs["tp_group"] = device_mesh["tp"].get_group()
             vit_kwargs["tp_size"] = device_mesh["tp"].size()
+        elif cfg.model.transformer_engine and not _TE_INSTALLED:
+            raise ImportError("TransformerEngine was requested (cfg.model.transformer_engine=True) but is not installed.")

139-141: Avoid channels_last on meta.

Calling .to(memory_format=...) on meta parameters is risky. Gate it on not meta_init (see earlier diff).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf087af and ef8f31d.

📒 Files selected for processing (4)

LICENSE/third_party.txt (1 hunks)
recipes/vit/LICENSE (1 hunks)
recipes/vit/beans.py (1 hunks)
recipes/vit/vit.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

recipes/vit/beans.py
recipes/vit/LICENSE
LICENSE/third_party.txt

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.

📚 Learning: 2025-09-16T19:36:07.174Z

Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.

Applied to files:

recipes/vit/vit.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (4)

recipes/vit/vit.py (4)

16-80: Attribution and licensing look correct.

Apache-2.0 and MIT attributions for timm, Phil Wang, and Karpathy are clearly included. Matches our third_party entry.

1510-1516: TE internal attribute access is brittle across versions.

self_attention.proj.weight and layernorm_mlp.fc2_weight may not exist in all TE releases. Guard with hasattr or gate behind a version check.

Apply this diff:

-            if _TE_INSTALLED and self.block_fn == TransformerLayer:
-                rescale(layer.self_attention.proj.weight.data, layer_id + 1)
-                rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1)
+            if _TE_INSTALLED and self.block_fn == TransformerLayer:
+                if hasattr(layer, "self_attention") and hasattr(layer.self_attention, "proj") and hasattr(layer.self_attention.proj, "weight"):
+                    rescale(layer.self_attention.proj.weight.data, layer_id + 1)
+                if hasattr(layer, "layernorm_mlp") and hasattr(layer.layernorm_mlp, "fc2_weight"):
+                    rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1)

If you want, I can script-check the installed TE version and symbols across CI runners.

106-112: Defensive TE import is fine.

Short-circuit on _TE_INSTALLED prevents NameError when TransformerLayer isn’t present.

125-141: Meta init context is invalid; torch.device is not a context manager.

Using with torch.device("meta") will raise at runtime. Switch to torch.set_default_device("meta") with a try/finally, and avoid channels_last on meta.

Apply this diff:

-    with (
-        # Meta Device Initialization
-        torch.device("meta") if meta_init else nullcontext()
-    ):
-        vit_kwargs = dict(cfg.model.vit)
-        if meta_init:
-            vit_kwargs["weight_init"] = None
-        if cfg.model.transformer_engine and _TE_INSTALLED:
-            assert device_mesh is not None, "[build_model] device_mesh is required when using TransformerEngine."
-            vit_kwargs["block_fn"] = TransformerLayer
-            vit_kwargs["micro_batch_size"] = cfg.dataset.train.batch_size
-            vit_kwargs["tp_group"] = device_mesh["tp"].get_group()
-            vit_kwargs["tp_size"] = device_mesh["tp"].size()
-        model = VisionTransformer(**vit_kwargs)
-        if cfg.model.channels_last:
-            model.to(memory_format=torch.channels_last)
+    if meta_init:
+        torch.set_default_device("meta")
+    try:
+        vit_kwargs = dict(cfg.model.vit)
+        if meta_init:
+            vit_kwargs["weight_init"] = None
+        if cfg.model.transformer_engine and _TE_INSTALLED:
+            assert device_mesh is not None, "[build_model] device_mesh is required when using TransformerEngine."
+            vit_kwargs["block_fn"] = TransformerLayer
+            vit_kwargs["micro_batch_size"] = cfg.dataset.train.batch_size
+            vit_kwargs["tp_group"] = device_mesh["tp"].get_group()
+            vit_kwargs["tp_size"] = device_mesh["tp"].size()
+        model = VisionTransformer(**vit_kwargs)
+        if cfg.model.channels_last and not meta_init:
+            model.to(memory_format=torch.channels_last)
+    finally:
+        if meta_init:
+            torch.set_default_device("cpu")

Likely an incorrect or invalid review comment.

recipes/vit/vit.py

Signed-off-by: Cory Ye <cye@nvidia.com>

…cripts and tests using Torch.save checkpoint loading. Signed-off-by: Cory Ye <cye@nvidia.com>

Signed-off-by: Cory Ye <cye@nvidia.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

recipes/vit/README.md (3)

26-31: Replace PIP_CONSTRAINT tip with venv setup

Constraints can’t be “emptied” via env var; recommend a clean venv instead.

-```bash
-cd recipes/vit
-# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
-# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
-pip install -r requirements.txt
-```
+```bash
+cd recipes/vit
+# Recommended: create and activate a clean virtual environment (or use the provided Docker image)
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+```

41-41: Fix PyTorch DCP links to canonical domain

Use pytorch.org, not docs.pytorch.org.

-which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
+which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [PyTorch Distributed Checkpoint (DCP)](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
@@
-    Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
+    Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html
@@
-    Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
+    Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html

Also applies to: 62-63, 76-77

55-57: Add missing torch import in the Python snippet

Prevents NameError and makes the example runnable.

-```python
+```python
 import torch

recipes/vit/checkpoint.py (1)

155-165: Avoid world‑writable 0777 on checkpoints; prefer 0o775 (or respect umask)

0777 is risky on shared filesystems. Tighten perms; keep the rank‑0 logging guard.

-        # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root.
-        mode = 0o777
+        # Relax checkpoint permissions for collaboration; avoid world-writable.
+        mode = 0o775
         for dirpath, _, filenames in os.walk(ckpt_dir):
             # Change current directory perms.
             os.chmod(dirpath, mode)
             for filename in filenames:
                 # Change file perms.
                 file_path = Path(dirpath) / filename
                 os.chmod(file_path, mode)
         if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
             _logger.info(f"Saved validated checkpoint to: {ckpt_dir}")

🧹 Nitpick comments (11)

recipes/vit/beans.py (2)

49-56: Make sampler.set_epoch() optional to avoid AttributeError with non-Distributed samplers

SequentialSampler and many custom samplers don’t implement set_epoch. Guard the call.

 def infinite_dataloader(dataloader, sampler):
     """Create an infinite iterator that automatically restarts at the end of each epoch."""
     epoch = 0
     while True:
-        sampler.set_epoch(epoch)  # Update epoch for proper shuffling
+        # Update epoch for proper shuffling when supported (e.g., DistributedSampler)
+        if sampler is not None and hasattr(sampler, "set_epoch"):
+            sampler.set_epoch(epoch)
         for batch in dataloader:
             yield batch
         epoch += 1  # Increment epoch counter after completing one full pass

40-44: Specify interpolation when resizing to avoid PIL default (nearest) degrading image quality

PIL.Image.resize defaults to nearest-neighbor. Use bicubic/bilinear for classification inputs.

 import torch
 from datasets import load_dataset
 from torch.utils.data import Dataset
 from torchvision.transforms.functional import to_tensor
+from PIL import Image
@@
-        image_tensor = to_tensor(sample["image"].resize(self.resize_dimensions).convert("RGB"))
+        img = sample["image"].convert("RGB").resize(self.resize_dimensions, resample=Image.BICUBIC)
+        image_tensor = to_tensor(img)
         label_idx = sample["labels"]
         return image_tensor, label_idx

Also applies to: 85-87

recipes/vit/checkpoint.py (5)

35-36: Fix PyTorch docs URLs to canonical domain

Use pytorch.org for tutorials/docs.

-    Docs: https://docs.pytorch.org/tutorials/beginner/saving_loading_models.html
+    Docs: https://pytorch.org/tutorials/beginner/saving_loading_models.html
@@
-    Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
+    Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html
@@
-    Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html
+    Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html

Also applies to: 60-61, 123-124

45-51: Broaden the 'module.' prefix warning check and avoid StopIteration on empty state dict

Check all keys from the raw checkpoint dict; don’t rely on the first key.

-    first_key = next(iter(model_checkpoint))
-    if first_key.startswith("module.") and not megatron_fsdp:
+    if (not megatron_fsdp) and any(k.startswith("module.") for k in checkpoint["model"].keys()):
         _logger.warning(
-            f"Checkpoint state dictionary keys ({first_key}) may be prefixed "
+            "Checkpoint state dictionary keys may be prefixed "
             "with 'module.' if converted from a Megatron-FSDP DCP checkpoint."
             "Set megatron_fsdp=True to automatically strip the prefix."
         )

57-73: Use DCP with stateful objects directly; avoid redundant state_dict round‑trip

Passing the model/optimizer objects is the intended API and avoids extra memory/copies.

 def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None):
@@
-    # Load model and optimizer checkpoints.
-    state_dict = {}
-    if model is not None:
-        state_dict["model"] = model.state_dict()
-    if optimizer is not None:
-        state_dict["optimizer"] = optimizer.state_dict()
-    torch.distributed.checkpoint.load(state_dict, checkpoint_id=checkpoint_path)
-    if model is not None:
-        model.load_state_dict(state_dict["model"])
-    if optimizer is not None:
-        optimizer.load_state_dict(state_dict["optimizer"])
+    # Load model and optimizer checkpoints.
+    state = {}
+    if model is not None:
+        state["model"] = model
+    if optimizer is not None:
+        state["optimizer"] = optimizer
+    torch.distributed.checkpoint.load(state, checkpoint_id=checkpoint_path)

120-131: Mirror the stateful-objects approach for DCP save

This aligns save/load symmetry and leverages DCP’s integration with stateful objects.

 def save_dcp_checkpoint(checkpoint_path, model=None, optimizer=None):
@@
-    # Save model and optimizer checkpoints.
-    state_dict = {}
-    if model is not None:
-        state_dict["model"] = model.state_dict()
-    if optimizer is not None:
-        state_dict["optimizer"] = optimizer.state_dict()
-    torch.distributed.checkpoint.save(state_dict, checkpoint_id=checkpoint_path)
+    # Save model and optimizer checkpoints.
+    state = {}
+    if model is not None:
+        state["model"] = model
+    if optimizer is not None:
+        state["optimizer"] = optimizer
+    torch.distributed.checkpoint.save(state, checkpoint_id=checkpoint_path)

99-111: Harden auto‑resume checkpoint selection against malformed names

Defensively parse “step_loss”; skip dirs that don’t match; only use loss sorting for resume_from_metric in {“+”, “-”}.

-            opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1
-            latest_subdir = max(
-                subdirs,
-                key=lambda x: (
-                    opt_metric_coeff * float(x.name.split("_")[3])
-                    if cfg.training.checkpoint.resume_from_metric
-                    else 0,
-                    x.stat().st_mtime,
-                ),
-            )
+            mode = str(getattr(cfg.training.checkpoint, "resume_from_metric", "")).strip()
+            opt_metric_coeff = 1 if mode == "+" else -1
+            def _score(p: Path):
+                parts = p.name.split("_")
+                if mode in {"+", "-"} and len(parts) >= 4 and parts[0] == "step" and parts[2] == "loss":
+                    try:
+                        return (opt_metric_coeff * float(parts[3]), p.stat().st_mtime)
+                    except ValueError:
+                        pass
+                return (0, p.stat().st_mtime)
+            latest_subdir = max(subdirs, key=_score)

recipes/vit/README.md (4)

3-3: Fix TIMM casing and replace bare URL with a markdown link

Minor polish; avoids MD034 and improves readability.

-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_
+_Adapted ViT model code from [huggingface/pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) written by Ross Wightman (@rwightman, © 2020)._

11-13: Add languages to fenced code blocks and note NGPU env

Improves linting (MD040) and makes examples copy‑pasteable.

-```
+```bash
 docker build -t <image_repo>:<image_tag> .

@@
- +bash

Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!

docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>

@@
-```
+```bash
cd recipes/vit
# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts.
# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe.
pip install -r requirements.txt

@@
- +bash
+# set NGPU to the number of visible GPUs, e.g.: export NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit

@@
-```
+```bash
python -m torch.distributed.checkpoint.format_utils --help
usage: format_utils.py [-h] {torch_to_dcp,dcp_to_torch} src dst
@@
-```
+```bash
python -m torch.distributed.checkpoint.format_utils dcp_to_torch step_75_loss_1.725 torch_ckpt_test.pt



Also applies to: 17-21, 26-31, 37-39, 105-117, 121-123

---

`5-9`: **Normalize heading levels (MD001)**

Increment by one level at a time for cleaner structure.

```diff
-### Pre-Requisites
+## Pre-Requisites
@@
-#### Docker Container
+### Docker Container
@@
-#### PIP Install
+### PIP Install

159-159: Replace bare URL with a labeled link

Minor docs polish.

-TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, you can look into: https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/
+TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, see the PyTorch blog: [HuggingFace `safetensors` support in PyTorch Distributed Checkpointing](https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef8f31d and 260c5a5.

📒 Files selected for processing (19)

.github/workflows/unit-tests-recipes.yml (1 hunks)
LICENSE/third_party.txt (1 hunks)
recipes/vit/.gitignore (1 hunks)
recipes/vit/.ruff.toml (1 hunks)
recipes/vit/Dockerfile (1 hunks)
recipes/vit/LICENSE (1 hunks)
recipes/vit/README.md (1 hunks)
recipes/vit/beans.py (1 hunks)
recipes/vit/checkpoint.py (1 hunks)
recipes/vit/config/defaults.yaml (1 hunks)
recipes/vit/config/vit_base_patch16_224.yaml (1 hunks)
recipes/vit/config/vit_te_base_patch16_224.yaml (1 hunks)
recipes/vit/distributed.py (1 hunks)
recipes/vit/infer.py (1 hunks)
recipes/vit/requirements.txt (1 hunks)
recipes/vit/test_infer.py (1 hunks)
recipes/vit/test_train.py (1 hunks)
recipes/vit/train.py (1 hunks)
recipes/vit/vit.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

recipes/vit/.gitignore

🚧 Files skipped from review as they are similar to previous changes (15)

.github/workflows/unit-tests-recipes.yml
recipes/vit/config/vit_base_patch16_224.yaml
recipes/vit/LICENSE
recipes/vit/Dockerfile
recipes/vit/config/vit_te_base_patch16_224.yaml
recipes/vit/.ruff.toml
recipes/vit/test_infer.py
recipes/vit/test_train.py
recipes/vit/infer.py
recipes/vit/distributed.py
recipes/vit/train.py
recipes/vit/requirements.txt
recipes/vit/config/defaults.yaml
LICENSE/third_party.txt
recipes/vit/vit.py

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.

📚 Learning: 2025-09-16T19:36:07.174Z

Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.

Applied to files:

recipes/vit/README.md

📚 Learning: 2025-09-09T23:57:34.120Z

Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1139
File: models/geneformer/src/geneformer/convert.py:1-3
Timestamp: 2025-09-09T23:57:34.120Z
Learning: NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier instead of the standard "Apache-2.0" identifier in their codebase header files.

Applied to files:

recipes/vit/checkpoint.py

🧬 Code graph analysis (1)

recipes/vit/beans.py (1)

sub-packages/bionemo-size-aware-batching/tests/bionemo/size_aware_batching/conftest.py (1)

sampler (92-93)

🪛 markdownlint-cli2 (0.17.2)

recipes/vit/README.md

3-3: Bare URL used

(MD034, no-bare-urls)

5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

121-121: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

159-159: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (1)

recipes/vit/checkpoint.py (1)

2-2: SPDX identifier is correct for this repo

LicenseRef-Apache2 is the expected identifier per NVIDIA policy; keep as-is.

cspades self-assigned this Sep 2, 2025

cspades requested review from dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn and trvachov as code owners September 2, 2025 22:15

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

cspades force-pushed the cye/vit-recipe branch from 2f9a215 to c5a719b Compare September 2, 2025 22:47

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

cspades force-pushed the cye/vit-recipe branch 2 times, most recently from 4408fb5 to d97f7a2 Compare September 3, 2025 00:13

cspades enabled auto-merge September 3, 2025 00:16

pstjohn approved these changes Sep 3, 2025

View reviewed changes

cspades disabled auto-merge September 3, 2025 00:58

cspades requested review from ntadimeti and spoulos-nvidia September 3, 2025 16:14

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

recipes/vit/README.md Show resolved Hide resolved

cspades force-pushed the cye/vit-recipe branch from 68b2b7a to ba27aa8 Compare September 3, 2025 16:19

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

jwilber approved these changes Sep 3, 2025

View reviewed changes

pstjohn approved these changes Sep 3, 2025

View reviewed changes

recipes/vit/train.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

recipes/vit/test_infer.py Show resolved Hide resolved

cspades force-pushed the cye/vit-recipe branch 2 times, most recently from 24e0a65 to 0fffd7b Compare September 3, 2025 19:50

cspades force-pushed the cye/vit-recipe branch 3 times, most recently from 43430ee to c09c4bb Compare September 16, 2025 02:59

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

recipes/vit/config/vit_base_patch16_224.yaml Show resolved Hide resolved

recipes/vit/requirements.txt Show resolved Hide resolved

recipes/vit/requirements.txt Show resolved Hide resolved

recipes/vit/train.py Show resolved Hide resolved

recipes/vit/train.py Show resolved Hide resolved

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

LICENSE/third_party.txt Show resolved Hide resolved

LICENSE/third_party.txt Show resolved Hide resolved

LICENSE/third_party.txt Show resolved Hide resolved

cspades force-pushed the cye/vit-recipe branch from cf087af to ef8f31d Compare September 16, 2025 19:40

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

recipes/vit/vit.py Show resolved Hide resolved

recipes/vit/vit.py Show resolved Hide resolved

cspades added 13 commits September 16, 2025 15:56

Add TIMM ViT recipe to BioNeMo with Megatron-FSDP.

ef5a4c8

Signed-off-by: Cory Ye <cye@nvidia.com>

Use megatron-fsdp==0.1.0rc1 which has the >2D parameter fix.

0f5eb8f

Signed-off-by: Cory Ye <cye@nvidia.com>

Lint.

bb4d03f

Signed-off-by: Cory Ye <cye@nvidia.com>

Add num_classes suggestion by CodeRabbit. Good rabbit.

1ccb1ea

Signed-off-by: Cory Ye <cye@nvidia.com>

Compress the dataset, and add dataset un-tar support to train.py.

8d9a418

Signed-off-by: Cory Ye <cye@nvidia.com>

Refactor, docstring updates, LICENSE-ing / attribution to Ross, etc.

808cde8

Signed-off-by: Cory Ye <cye@nvidia.com>

Pin TP to 1, remove from config.

6576947

Signed-off-by: Cory Ye <cye@nvidia.com>

Refactor everything, add checkpointing utilities, and add inference s…

7fd0c94

…cripts and tests using Torch.save checkpoint loading. Signed-off-by: Cory Ye <cye@nvidia.com>

Minor function and config changes.

24238d5

Signed-off-by: Cory Ye <cye@nvidia.com>

Fix TE sharding, README.md updates to add inference instructions.

6818f0a

Signed-off-by: Cory Ye <cye@nvidia.com>

Migrate to Beans dataset (MIT License).

2db8c04

Signed-off-by: Cory Ye <cye@nvidia.com>

Increase recipe unit testing Docker container shared memory.

4df64aa

Signed-off-by: Cory Ye <cye@nvidia.com>

Add ViT and Beans licensing to BioNeMo.

260c5a5

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades force-pushed the cye/vit-recipe branch from ef8f31d to 260c5a5 Compare September 16, 2025 22:56

cspades enabled auto-merge September 16, 2025 22:59

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

pstjohn approved these changes Sep 16, 2025

View reviewed changes

cspades added this pull request to the merge queue Sep 17, 2025

Merged via the queue into main with commit 54bfba8 Sep 17, 2025
19 checks passed

cspades deleted the cye/vit-recipe branch September 17, 2025 03:08

Conversation

cspades commented Sep 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODO

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Utilize plenty of shared memory (--shm-size) to support loading large batches of image data!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cspades commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

codecov-commenter commented Sep 4, 2025 •

edited

Loading