Add a generic ViT model to BioNeMo Recipes.#1103
Conversation
WalkthroughAdds a complete ViT recipe under recipes/vit: model implementation, Hydra configs (including TransformerEngine variants), distributed initializer, training/inference scripts, DCP↔Torch checkpoint utilities with auto-resume, dataset loader, tests, Dockerfile, per-recipe lint/.gitignore, and license additions. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Hydra as Hydra (configs)
participant Train as train.py
participant Dist as initialize_distributed
participant Model as build_vit_model
participant Data as BeansDataset/Sampler
participant FSDP as Megatron‑FSDP
participant Ckpt as checkpoint.py
User->>Hydra: run training config
Hydra-->>Train: cfg
Train->>Dist: init(dp_outer, dp_shard, cp, tp)
Dist-->>Train: DeviceMesh
Train->>Model: build_vit_model(cfg, DeviceMesh)
Model-->>Train: ViT model
Train->>FSDP: wrap(model, cfg.fsdp)
Train->>Data: create loaders
Train->>Ckpt: load_auto_resume_checkpoint(cfg, model, optim)
alt resumed
Ckpt-->>Train: resumed step_idx
else fresh
Ckpt-->>Train: None
end
loop training
Train->>FSDP: forward/backward/step
Train->>Ckpt: save_auto_resumable_checkpoint(...)
end
sequenceDiagram
autonumber
actor User
participant Hydra as Hydra (configs)
participant Infer as infer.py
participant Dist as initialize_distributed
participant Model as build_vit_model
participant Ckpt as load_torch_checkpoint
User->>Hydra: run inference config
Hydra-->>Infer: cfg
Infer->>Dist: init(...)
Dist-->>Infer: DeviceMesh
Infer->>Model: build_vit_model(cfg, DeviceMesh)
Infer->>Ckpt: load_torch_checkpoint(path, model, megatron_fsdp=cfg.inference.checkpoint.megatron_fsdp)
Ckpt-->>Infer: model weights loaded
Infer->>Model: forward(mock input)
Model-->>Infer: logits
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (21)
recipes/vit/.ruff.toml (2)
3-3: Per-file ignore pattern likely too broad/non-matchingKey "tokenizer_auto" may not match any file. Use a glob with extension.
-[lint] -per-file-ignores = { "tokenizer_auto" = ["ALL"] } +[lint] +per-file-ignores = { "**/tokenizer_auto.py" = ["ALL"] }
4-4: Avoid blanket disabling of core lint rules (F/E).Disabling F (pyflakes) and E (pycodestyle) hides real bugs (e.g., undefined names). Narrow the ignore set and rely on the repo root for global policy.
-ignore = ["RUF","D","N","E","PLW","PERF","C","F"] +ignore = ["RUF","D","N","PLW","PERF","C"]If you need to suppress specific errors in this recipe, prefer targeted per-file ignores.
recipes/vit/.gitignore (1)
1-7: Add common ML/artifact ignores to keep the recipe tree clean.Optional additions that frequently show up during runs.
*.tar* *.zip checkpoints/ wandb/ outputs/ __pycache__/ .ruff_cache/ +lightning_logs/ +.ipynb_checkpoints/ +events.out.tfevents* +*.ckpt +*.pt +*.pth +nvlogs/ +.cache/recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt (1)
1-20: Confirm schema and coordinate convention in README.File appears to be TSV: filename, xmin, ymin, xmax, ymax within [0,63]. Add a short README in this folder stating schema, range, and whether boxes are inclusive/exclusive to avoid off-by-one issues downstream.
I can draft the README if you confirm the convention (e.g., COCO-style with xmax/ymax exclusive).
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/n09193705_boxes.txt (1)
1-1: Prefer TSV consistently (single delimiter).Ensure a single TAB delimiter across all lines/files; mixed spaces/tabs will break simple parsers.
recipes/vit/Dockerfile (2)
3-6: Harden pip install and improve build portability.
- Use python -m pip and --no-cache-dir.
- Avoid relying solely on BuildKit binds for portability; fall back to copying requirements when not using BuildKit.
Apply:
-RUN --mount=type=secret,id=netrc,target=/root/.netrc \ - --mount=type=cache,target=/root/.cache/pip \ - --mount=type=bind,source=requirements.txt,target=/requirements.txt \ - PIP_CONSTRAINT= pip install -r /requirements.txt +RUN --mount=type=secret,id=netrc,target=/root/.netrc \ + --mount=type=cache,target=/root/.cache/pip \ + --mount=type=bind,source=requirements.txt,target=/requirements.txt \ + PIP_CONSTRAINT= python -m pip install --no-cache-dir -r /requirements.txtOptional portability fallback (outside BuildKit): copy requirements first to leverage layer cache, then install.
8-9: Reduce image size and avoid copying training data by default.COPY . . will include the dataset into the image. Prefer a .dockerignore (or COPY only needed subdirs) and mount data at runtime.
Add recipes/vit/.dockerignore:
# Python/build __pycache__/ *.pyc .ruff_cache/ # VCS .git .gitignore # Data and outputs data/ outputs/ wandb/ checkpoints/ ckpts/recipes/vit/README.md (7)
3-3: Avoid bare URL; use a Markdown link.Conform to MD034 and improve readability.
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM), which you can check out here: https://github.com/huggingface/pytorch-image-models_ +_Adapted ViT model code from [huggingface/pytorch-image-models (TIMM)](https://github.com/huggingface/pytorch-image-models)._
5-11: Fix heading level increment (H2 after H1).Matches MD001.
-### Pre-Requisites +## Pre-requisites
11-14: Annotate code fences with language.Satisfy MD040.
-``` +```bash docker build -t <image_repo>:<image_tag> .--- `17-21`: **Annotate code fences with language.** ```diff -``` +```bash # Utilize plenty of shared memory (--shm-size) to support loading large batches of image data! docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>--- `26-31`: **Annotate code fences and tighten guidance about PIP_CONSTRAINT.** Add bash and clarify optional env var usage. ```diff -``` +```bash cd recipes/vit -# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. +# If dependency resolution fails, you can unset constraints via: PIP_CONSTRAINT= pip install ... # We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. pip install -r requirements.txt--- `37-41`: **Annotate training command fence and align path/config examples.** Add bash and ensure checkpoint path aligns with YAML defaults. ```diff -``` +```bash torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vitIf the TE config uses "./checkpoints/vit_te", either: - change README example to match, or - update vit_te_base_patch16_224.yaml to "./ckpts/vit" for consistency. --- `45-47`: **Fix broken/misleading link text.** The link text says imagenet_*.py but points to imagenet_dataset.py. Make both consistent. ```diff -The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py). +The TIMM-derived model code for the ViT is in [`vit.py`](vit.py), and ImageNet data utilities are in [`imagenet_dataset.py`](imagenet_dataset.py).recipes/vit/config/vit_te_base_patch16_224.yaml (2)
9-12: Add trailing newline to satisfy linters/pipeline.Fixes EOF warning and CI “End-of-File fixer modified the file”.
path: "./checkpoints/vit_te" resume_from_metric: "-" # + = Highest Metric (Score), - = Lowest Metric (Loss) +
9-12: Align checkpoint path with README or document divergence.Either update README example to "./checkpoints/vit_te" when using this config or change this path to "./ckpts/vit" for consistency.
recipes/vit/config/vit_base_patch16_224.yaml (2)
11-24: Pooling/class token/reg_tokens combo is unusual for a base ViT.Using global_pool="map" with class_token=true and reg_tokens=8 increases compute and may not be intended for a tiny demo. Consider reverting to token pooling with no reg tokens or disabling class_token when using MAP.
- global_pool: "map" - class_token: true - reg_tokens: 8 + global_pool: "token" + class_token: true + reg_tokens: 0(Or keep MAP and set class_token: false.)
21-21: Init settings diverge from defaults; may impact reproducibility.If not intentional, match defaults for stable behavior.
- init_values: null + init_values: null # keep or set per variant - weight_init: null - init_variance_rescale: true + weight_init: "timm" + init_variance_rescale: falseAlso applies to: 36-39
recipes/vit/config/defaults.yaml (3)
95-99: Boolean style is inconsistent (True vs true).YAML accepts both, but consistency helps avoid parser surprises and diffs.
- normalize: True + normalize: true
100-117: Transforms look reasonable; consider exposing common aug knobs at top-level if you expect users to tweak frequently.Optional; keeps defaults uncluttered but improves discoverability.
1-4: Add a brief header comment with the expected override points (data_root, num_classes, TE toggle).Improves UX for folks new to Hydra configs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (288)
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_106.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1074.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1118.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1224.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1225.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1270.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1277.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1318.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1325.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1339.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1359.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1381.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1386.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1417.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1432.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1555.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1568.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1603.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1636.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1654.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1660.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1671.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1740.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1802.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1855.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1900.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1934.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_1998.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_202.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_2026.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_2066.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_217.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_2234.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_2313.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_235.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_2354.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_2423.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/val/images/val_2454.JPEGis excluded by!**/*.jpeg
📒 Files selected for processing (12)
recipes/vit/.gitignore(1 hunks)recipes/vit/.ruff.toml(1 hunks)recipes/vit/Dockerfile(1 hunks)recipes/vit/README.md(1 hunks)recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n09193705/n09193705_boxes.txt(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 YAMLlint (1.37.1)
recipes/vit/config/vit_te_base_patch16_224.yaml
[error] 12-12: no new line character at the end of file
(new-line-at-end-of-file)
🪛 GitHub Actions: BioNeMo Framework CI
recipes/vit/config/vit_te_base_patch16_224.yaml
[error] 1-1: End-of-File fixer modified the file. Re-run pre-commit to apply and commit changes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (7)
recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt (1)
1-500: Document bbox schema and inclusive bounds.
The dataset consistently uses (xmin, ymin, xmax, ymax) across all classes, with coordinates ranging from 0 to 63 inclusive for 64 × 64 images. Add a note in the dataset loader or README to state this explicitly. No normalization across files is needed and no degenerate boxes were found.Likely an incorrect or invalid review comment.
recipes/vit/config/vit_te_base_patch16_224.yaml (1)
1-4: Defaults ordering LGTM.Hydra defaults look correct with self last to allow overrides from included configs.
recipes/vit/config/vit_base_patch16_224.yaml (2)
64-67: Confirm resume_from_metric accepts "-" in your trainer.If the trainer expects null/None or explicit "+"/"-", passing "-" as a string might silently no-op or misroute.
Would you like me to wire a small validation hook that asserts the field at startup?
51-54: Module path sanity check for FSDP unit modules.Ensure vit.Block and vit.PatchEmbed resolve to the actual implementation module; mismatches will crash at wrap time.
I can add a startup import check that logs resolved types before FSDP wrapping if helpful.
recipes/vit/config/defaults.yaml (3)
49-61: Good: FSDP strategy values use descriptive strings.Keeps configs readable and avoids magic numbers. Please keep other configs consistent with this format.
7-16: Defaults favor a plain ViT-B; sensible for a baseline.qk_norm/scale_* disabled here aligns with timm ViT-B. Keep variant-specific tweaks in recipe configs.
43-48: Distributed defaults are sane for single-node demos.No action needed.
recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt
Outdated
Show resolved
Hide resolved
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt
Outdated
Show resolved
Hide resolved
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt
Outdated
Show resolved
Hide resolved
recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt
Outdated
Show resolved
Hide resolved
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/n09193705_boxes.txt
Outdated
Show resolved
Hide resolved
recipes/vit/data/super-tiny-imagenet-5/train/n09193705/n09193705_boxes.txt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
recipes/vit/.ruff.toml (1)
2-3: Useextend-ignoreto augment root ignores; avoid overriding them.Setting
lint.ignorehere replaces the parent’s ignores. If you meant to add to them, uselint.extend-ignore. Also consider setting a higher McCabe threshold instead of disablingC901entirely, and prefer targetedper-file-ignoresforD/Nto avoid muting useful signals across the whole recipe.Apply:
[lint] -ignore = ["D", "N", "C901", "PLW2901"] +extend-ignore = ["D", "N", "PLW2901"] + +# Prefer constraining complexity rather than disabling it entirely. +[lint.mccabe] +max-complexity = 20Optionally, if only tests or scripts need relaxed rules:
[lint.per-file-ignores] "**/tests/**/*.py" = ["D", "N"] "**/scripts/*.py" = ["D", "N"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
recipes/vit/.ruff.toml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- recipes/vit/config/vit_te_base_patch16_224.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (1)
recipes/vit/.ruff.toml (1)
1-1: No update required:extend = "../.ruff.toml"correctly extendsrecipes/.ruff.toml; there is no root.ruff.tomlto reference.Likely an incorrect or invalid review comment.
2f9a215 to
c5a719b
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
recipes/vit/config/defaults.yaml (1)
70-72: Fix unsafe default: 100000 classes will instantiate an oversized head by default.Set a sane default (e.g., ImageNet-1k), and override per recipe (e.g., 5 for super-tiny-ImageNet-5).
Apply:
-dataset: - num_classes: 100000 +dataset: + num_classes: 1000 # override to 5 in vit_*_patch16_224.yaml for super-tiny-imagenet-5
🧹 Nitpick comments (6)
recipes/vit/config/defaults.yaml (6)
78-79: Enable shuffling for training.Non-shuffled training hurts convergence, even for tiny demos.
- batch_size: 1 - shuffle: false + batch_size: 1 + shuffle: true
81-81: De-duplicate config by referencing model image/patch sizes.Prevents drift when changing img_size/patch_size.
- img_size: 224 + img_size: ${model.vit.img_size} @@ - patch_size: 16 + patch_size: ${model.vit.patch_size} @@ - img_size: 224 + img_size: ${model.vit.img_size} @@ - patch_size: 16 + patch_size: ${model.vit.patch_size}Also applies to: 98-99, 108-108, 116-117
52-56: Avoid over-fragmenting FSDP units; and verify class paths.Wrapping every torch.nn.Linear/LayerNorm can create thousands of FSDP units and tank perf. Prefer wrapping larger blocks; also ensure the vit.* symbols match actual classes.
fsdp_unit_modules: - vit.Block - vit.PatchEmbed - - torch.nn.LayerNorm - - torch.nn.LinearWould you like me to check the module paths across the repo to confirm they resolve?
66-68: Use a non-null sentinel for resume_from_metric.If the loader treats null specially, a sentinel like "-" is safer.
checkpoint: path: null - resume_from_metric: null + resume_from_metric: "-"
72-72: Make num_workers configurable via env with a sensible default.Improves I/O without hardcoding per machine.
- num_workers: 0 + num_workers: ${oc.env:NUM_WORKERS,4}If Hydra env interpolation isn’t available here, keep 0 and document overriding it per environment.
85-96: YAML style consistency (booleans/floats).Standardize booleans to lowercase and floats to explicit decimals for readability.
- vflip: 0. + vflip: 0.0 @@ - color_jitter_prob: null - grayscale_prob: 0. - gaussian_blur_prob: 0. + color_jitter_prob: null + grayscale_prob: 0.0 + gaussian_blur_prob: 0.0 @@ - normalize: True - separate: False + normalize: true + separate: false @@ - patchify: False + patchify: false @@ - normalize: true - patch_size: 16 - patchify: false + normalize: true + patch_size: ${model.vit.patch_size} + patchify: falseAlso applies to: 115-117
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- recipes/vit/config/vit_base_patch16_224.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (1)
recipes/vit/config/defaults.yaml (1)
6-7: Good: derive classifier size from dataset config.Binding model.vit.num_classes to ${dataset.num_classes} avoids head/dataset drift.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (10)
recipes/vit/README.md (10)
3-3: Fix TIMM casing and avoid bare URL (MD034).Use a proper Markdown link and correct “TImM” → “timm”.
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM), which you can check out here: https://github.com/huggingface/pytorch-image-models_ +_Adapted ViT model code from [timm (huggingface/pytorch-image-models)](https://github.com/huggingface/pytorch-image-models)._
5-7: Fix heading levels (MD001) and minor wording.H3 → H2 for section, H4 → H3 for subsection; “Pre-Requisites” → “Prerequisites”.
-### Pre-Requisites - -#### Docker Container +## Prerequisites + +### Docker container
22-24: Make heading level consistent (MD001) and casing.-#### PIP Install +### pip install
11-13: Add language to fenced code block (MD040).-``` +```bash docker build -t <image_repo>:<image_tag> .--- `17-20`: **Add language to fenced code block (MD040).** ```diff -``` +```bash # Utilize plenty of shared memory (--shm-size) to support loading large batches of image data! docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>--- `26-31`: **Add language to fenced code block (MD040) and tiny clarity note.** ```diff -``` +```bash cd recipes/vit # If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. # We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. pip install -r requirements.txt--- `37-39`: **Define NGPU, ensure checkpoint dir exists, and add language to code block.** Without NGPU set, the command fails. Also pre-create the output path. ```diff -``` -torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit -``` +```bash +# Set GPUs (override by exporting NGPU beforehand) +export NGPU=${NGPU:-2} +mkdir -p ./ckpts/vit +torchrun --nproc-per-node "${NGPU}" train.py --config-name vit_base_patch16_224 distributed.dp_shard="${NGPU}" training.checkpoint.path=./ckpts/vit +```
41-41: Possible broken PyTorch docs URL; verify and update.Likely correct domain is pytorch.org/docs, not docs.pytorch.org/docs.
-... save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints ... +... save auto-resumable [Torch DCP](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints ...
43-45: Verify relative links and align anchor text.Confirm these files exist under recipes/vit/ and fix the mismatched anchor text vs link.
-[`train.py`](train.py) is the transparent entrypoint to this script ... +[`train.py`](./train.py) is the transparent entrypoint to this recipe ... ... -The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py). +The TIMM-derived model code for the ViT can be found in [`vit.py`](./vit.py), and ImageNet dataset utilities can be found in [`imagenet_dataset.py`](./imagenet_dataset.py).
43-43: Minor wording nit.“entrypoint to this script” reads oddly; use “entrypoint script” or “entrypoint to this recipe.”
-[`train.py`](./train.py) is the transparent entrypoint to this script +[`train.py`](./train.py) is the transparent entrypoint script
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (229)
recipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03404251/images/n03404251_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n03976657/images/n03976657_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n04507155/images/n04507155_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_34.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_35.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_36.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_37.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_38.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_39.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_4.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_40.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_41.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_42.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_43.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_44.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_45.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_46.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_47.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_48.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_49.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_5.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_6.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_7.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_8.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n06596364/images/n06596364_9.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_0.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_1.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_10.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_11.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_12.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_13.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_14.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_15.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_16.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_17.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_18.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_19.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_2.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_20.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_21.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_22.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_23.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_24.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_25.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_26.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_27.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_28.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_29.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_3.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_30.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_31.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_32.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_33.JPEGis excluded by!**/*.jpegrecipes/vit/data/super-tiny-imagenet-5/train/n09193705/images/n09193705_34.JPEGis excluded by!**/*.jpeg
📒 Files selected for processing (11)
recipes/vit/.gitignore(1 hunks)recipes/vit/.ruff.toml(1 hunks)recipes/vit/Dockerfile(1 hunks)recipes/vit/README.md(1 hunks)recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt(1 hunks)recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- recipes/vit/.gitignore
🚧 Files skipped from review as they are similar to previous changes (9)
- recipes/vit/data/super-tiny-imagenet-5/train/n06596364/n06596364_boxes.txt
- recipes/vit/data/super-tiny-imagenet-5/train/n03404251/n03404251_boxes.txt
- recipes/vit/.ruff.toml
- recipes/vit/Dockerfile
- recipes/vit/config/vit_te_base_patch16_224.yaml
- recipes/vit/data/super-tiny-imagenet-5/train/n03976657/n03976657_boxes.txt
- recipes/vit/config/vit_base_patch16_224.yaml
- recipes/vit/config/defaults.yaml
- recipes/vit/data/super-tiny-imagenet-5/train/n04507155/n04507155_boxes.txt
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: pre-commit
- GitHub Check: Analyze (rust)
- GitHub Check: Analyze (python)
4408fb5 to
d97f7a2
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
recipes/vit/README.md (1)
3-3: Fix attribution formatting, link, and provenance note; also add matching header in vit.pyUse a proper Markdown link (avoids MD034), correct "timm" casing, and point readers to the bundled LICENSE and a provenance header in vit.py. This aligns with the prior ask to include copyright/provenance in code.
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_ +_Adapted ViT model code from [timm (pytorch-image-models)](https://github.com/huggingface/pytorch-image-models) by Ross Wightman (Copyright 2020). See `recipes/vit/LICENSE` and provenance notes in `vit.py`._
🧹 Nitpick comments (8)
recipes/vit/README.md (8)
5-5: Fix heading levels and wording (MD001)
Heading levels should increment by one. Also standardize wording/casing.-### Pre-Requisites +## Prerequisites @@ -#### Docker Container +### Docker container @@ -#### PIP Install +### pip installAlso applies to: 7-7, 22-22
11-11: Specify languages for fenced blocks (MD040)
Add bash language to all shell code fences.-``` +```bash @@ -``` +```bash @@ -``` +```bash @@ -``` +```bashAlso applies to: 17-17, 26-26, 37-37
19-19: Mount workspace into the container
Without a bind mount and workdir, users won’t see repo files inside the container.-docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag> +docker run -it --rm --gpus=all --shm-size=16G -v "$(pwd)":/workspace -w /workspace <image_repo>:<image_tag>
37-39: Show NGPU definition before torchrun
Make the example self-contained by defining NGPU.-```bash -torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit -``` +```bash +export NGPU=2 # example +torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit +```
41-41: Fix (likely) broken PyTorch DCP docs URL
“docs.pytorch.org/docs/...” looks wrong; PyTorch docs are usually at pytorch.org/docs/...-which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory. +which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
43-43: Minor wording/typo fixes
“entrypoint” → “entry point”; “de-compressed” → “decompressed”.-[`train.py`](train.py) is the transparent entrypoint to this script that explains how to modify your own training loop for `Megatron-FSDP` ([PyPI: `megatron-fsdp`](https://pypi.org/project/megatron-fsdp/) / [Source: Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src)) to fully-shard your model across all devices. After executing `train.py` for the first time, the de-compressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` (sourced from [`super-tiny-imagenet-5.tar.gz`](./data/super-tiny-imagenet-5.tar.gz)) for experimentation and review. +[`train.py`](train.py) is the transparent entry point to this recipe that explains how to modify your own training loop for `Megatron-FSDP` ([PyPI: `megatron-fsdp`](https://pypi.org/project/megatron-fsdp/) / [Source: Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src)) to fully shard your model across all devices. After executing `train.py` for the first time, the decompressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` (sourced from [`super-tiny-imagenet-5.tar.gz`](./data/super-tiny-imagenet-5.tar.gz)) for experimentation and review.
45-45: Normalize “timm” casing
Use “timm” (project’s canonical name) for consistency.-The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py). +The timm-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py).
38-39: Optional: add TE config example next to the base config
Make it obvious how to enable TransformerEngine.Example block to append after the base command:
# TransformerEngine variant export NGPU=2 torchrun --nproc-per-node ${NGPU} train.py --config-name vit_te_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit_te
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
recipes/vit/data/super-tiny-imagenet-5.tar.gzis excluded by!**/*.gz
📒 Files selected for processing (3)
recipes/vit/.gitignore(1 hunks)recipes/vit/LICENSE(1 hunks)recipes/vit/README.md(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- recipes/vit/LICENSE
🚧 Files skipped from review as they are similar to previous changes (1)
- recipes/vit/.gitignore
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests (recipes/vit, nvcr.io/nvidia/pytorch:25.06-py3)
- GitHub Check: Analyze (rust)
68b2b7a to
ba27aa8
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (1)
recipes/vit/README.md (1)
22-31: Replace incorrect PIP_CONSTRAINT note; recommend venv and fence language.This mirrors a prior comment; use a virtual environment and remove the misleading PIP_CONSTRAINT guidance.
-#### PIP Install +### Python environment (pip) @@ -If you have a virtual environment and CUDA installed, you can install the recipe's dependencies using `pip`: +If you have CUDA installed, install the recipe’s dependencies in an isolated virtual environment: @@ -``` -cd recipes/vit -# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. -# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. -pip install -r requirements.txt -``` +```bash +cd recipes/vit +python -m venv .venv +source .venv/bin/activate +pip install -r requirements.txt +```
🧹 Nitpick comments (11)
recipes/vit/imagenet_dataset.py (1)
304-329: Consider adding a maximum retry limit configuration.The current implementation retries up to 50 consecutive errors, which is hardcoded. Consider making this configurable for different use cases.
You could add a parameter to the constructor:
def __init__( self, root, reader=None, class_map: Optional[str | dict] = None, label_map: Optional[str | dict] = None, load_bytes=False, input_img_mode="RGB", transform=None, target_transform=None, class_filter: Optional[List[Any]] = None, + max_consecutive_errors: int = 50, **kwargs, ):Then use it in
__getitem__:- if self._consecutive_errors < 50: + if self._consecutive_errors < self.max_consecutive_errors:recipes/vit/vit.py (1)
1357-1402: Add sequence‐length assertion before TransformerEngine blocks
In the model’s forward (e.g. in VisionTransformer.forward), assert thatx.shape[1] == embed_len(whereembed_len = num_patches + num_prefix_tokenswhenno_embed_class=False, elsenum_patches) before callingself.blocks. This will catch any mismatch between the input sequence length and theseq_lengthpassed to the TransformerEngine.recipes/vit/train.py (2)
197-201: Document the sub-mesh flattening behavior.The warning about sub-meshes being deleted by Torch is important. Consider adding more detailed documentation about when and why these references might be deleted.
Add more explicit documentation:
# Sub-meshes (possibly) required for Megatron-FSDP. - # WARNING: These have a tendency to be deleted by Torch. Save references - # or pass them to all classes or functions that use them. + # WARNING: PyTorch's DeviceMesh sub-meshes can be garbage collected if no strong + # references are kept. These flattened meshes MUST be passed directly to functions + # that use them or stored as attributes to prevent deletion. The parent device_mesh + # alone does not keep these sub-meshes alive.
137-145: Consider using Path.chmod with explicit permissions.While the current implementation works, using Path's chmod method would be more Pythonic and avoid mixing os and Path operations.
- # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root. - mode = 0o777 - for dirpath, _, filenames in os.walk(ckpt_dir): - # Change current directory perms. - os.chmod(dirpath, mode) - for filename in filenames: - # Change file perms. - file_path = Path(dirpath) / filename - os.chmod(file_path, mode) + # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root. + mode = 0o777 + for path in ckpt_dir.rglob("*"): + path.chmod(mode) + ckpt_dir.chmod(mode)recipes/vit/README.md (7)
3-3: Fix TIMM attribution formatting and avoid bare URL (MD034).Use a proper Markdown link and correct capitalization.
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_ +_Adapted ViT model code from **timm** (PyTorch Image Models) by Ross Wightman ([@rwightman]) © 2020 — see [huggingface/pytorch-image-models](https://github.com/huggingface/pytorch-image-models)._
5-8: Fix heading levels and spelling (MD001).H2 should follow H1; child sections should increment by one; also “Prerequisites” spelling.
-### Pre-Requisites +## Prerequisites - -#### Docker Container +### Docker container
11-13: Add language to code fence (MD040).-``` +```bash docker build -t <image_repo>:<image_tag> .--- `17-20`: **Add language to code fence (MD040).** ```diff -``` +```bash # Utilize plenty of shared memory (--shm-size) to support loading large batches of image data! docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>--- `35-39`: **Specify code fence language and show NGPU example to reduce confusion.** Add bash fence and an explicit NGPU example (matches your 2‑GPU guidance). Confirm whether you want to highlight the TE config by default (vit_te_base_patch16_224) or keep the base variant here; both exist per configs. ```diff -``` -torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit -``` +```bash +# Example: 2 GPUs +NGPU=2 +torchrun --nproc-per-node ${NGPU} train.py \ + --config-name vit_base_patch16_224 \ + distributed.dp_shard=${NGPU} \ + training.checkpoint.path=./ckpts/vit +```
41-43: Fix PyTorch docs URL and minor wording.Correct docs host, and “decompressed” spelling.
-which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory. +which will train on a local tiny 5‑class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto‑resumable [Torch Distributed Checkpoint](https://pytorch.org/docs/stable/distributed.checkpoint.html) artifacts to the `training.checkpoint.path` directory. @@ -[...] After executing `train.py` for the first time, the de-compressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` [...] +[...] After executing `train.py` for the first time, the decompressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` [...]
45-46: Fix link text/target mismatch; link both dataset utilities explicitly.-The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py). +The TIMM‑derived model code for the ViT is in [`vit.py`](vit.py), and ImageNet data utilities are in [`imagenet_dataset.py`](imagenet_dataset.py) and [`imagenet_utils.py`](imagenet_utils.py).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
recipes/vit/LICENSE(1 hunks)recipes/vit/README.md(1 hunks)recipes/vit/imagenet_dataset.py(1 hunks)recipes/vit/imagenet_utils.py(1 hunks)recipes/vit/train.py(1 hunks)recipes/vit/vit.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- recipes/vit/imagenet_utils.py
- recipes/vit/LICENSE
🧰 Additional context used
🧬 Code graph analysis (1)
recipes/vit/train.py (3)
recipes/vit/imagenet_dataset.py (4)
ImageNetDataset(250-337)infinite_dataloader(38-45)filename(243-244)filename(333-334)recipes/vit/imagenet_utils.py (2)
transforms_imagenet_eval(946-1033)transforms_imagenet_train(775-943)recipes/vit/vit.py (1)
VisionTransformer(1218-1657)
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: changed-files
- GitHub Check: Analyze (rust)
🔇 Additional comments (5)
recipes/vit/imagenet_dataset.py (1)
38-45: LGTM! Clean infinite dataloader implementation.The implementation correctly handles epoch updates for proper shuffling in distributed training scenarios.
recipes/vit/vit.py (2)
1454-1456: LGTM! Proper meta device initialization support.The
reset_parametersmethod correctly delegates toinit_weights()for Megatron-FSDP meta device initialization compatibility.
95-96: No changes needed for LayerScale.reset_parameters: it correctly resetsgamma—the only parameter—for Megatron-FSDP meta-device initialization.recipes/vit/train.py (1)
333-336: LGTM! Important persistent workers configuration.The comment correctly highlights the critical requirement for
persistent_workers=Truewhen using Megatron-FSDP and Torch DCP to prevent worker conflicts.recipes/vit/README.md (1)
47-47: Config path reference looks good.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
recipes/vit/train.py (3)
58-100: Checkpoint name parsing is still vulnerable to malformed names.The code assumes checkpoints follow the exact format
step_<step_idx>_loss_<loss_value>and directly accesses split indices without validation, which can cause IndexError for malformed checkpoint names.The previous suggestion to add validation for checkpoint name parsing should still be implemented to prevent crashes from malformed checkpoint directories.
304-308: Tar extraction lacks error handling.The tar extraction could fail due to permissions, disk space, or corrupted archives without proper error handling.
The previous suggestion to add error handling for tar extraction should still be implemented to prevent silent failures or unclear error messages.
387-391: Dataloader resume logic may cause data duplication/skipping.Using
start=latest_step_idxinenumerate()only shifts the batch index counter—it doesn't advance the underlying iterator. This means the dataloader will restart from the beginning of the dataset rather than resuming from the correct position.The dataloader should properly skip the first
latest_step_idxbatches from the infinite iterator to ensure correct resumption without data duplication or skipping.
🧹 Nitpick comments (1)
recipes/vit/train.py (1)
116-148: Consider security implications of changing file permissions.Setting all files and directories to 0o777 (world-writable) could create security vulnerabilities in multi-user environments.
Consider using more restrictive permissions or making this configurable:
- mode = 0o777 + # Use more secure permissions - owner and group can write, others can read + dir_mode = 0o775 + file_mode = 0o664 for dirpath, _, filenames in os.walk(ckpt_dir): # Change current directory perms. - os.chmod(dirpath, mode) + os.chmod(dirpath, dir_mode) for filename in filenames: # Change file perms. file_path = Path(dirpath) / filename - os.chmod(file_path, mode) + os.chmod(file_path, file_mode)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/train.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- recipes/vit/config/vit_base_patch16_224.yaml
- recipes/vit/config/defaults.yaml
🧰 Additional context used
🧬 Code graph analysis (1)
recipes/vit/train.py (3)
recipes/vit/imagenet_dataset.py (6)
ImageNetDataset(250-337)infinite_dataloader(38-45)filenames(246-247)filenames(336-337)filename(243-244)filename(333-334)recipes/vit/imagenet_utils.py (2)
transforms_imagenet_eval(946-1033)transforms_imagenet_train(775-943)recipes/vit/vit.py (1)
VisionTransformer(1218-1657)
🔇 Additional comments (12)
recipes/vit/train.py (12)
1-38: LGTM! Clean imports and structure.The imports are well-organized and the file structure follows good practices with proper licensing and clear separation of concerns.
43-56: LGTM! Well-documented checkpoint loading function.The function properly handles distributed checkpoint loading with clear documentation and appropriate state dict management.
103-114: LGTM! Clean checkpoint saving implementation.The function properly handles distributed checkpoint saving with appropriate state dict structure.
150-201: LGTM! Comprehensive device mesh setup.The device mesh setup is well-documented and properly handles the distributed training configuration with appropriate validation and sub-mesh creation.
204-252: LGTM! Clean model initialization with TransformerEngine support.The model initialization properly handles both standard and TransformerEngine configurations with appropriate conditional logic and meta device support.
254-296: LGTM! Comprehensive FSDP configuration.The Megatron-FSDP setup is well-documented with clear explanations of each parameter and appropriate device mesh integration.
309-364: LGTM! Proper dataset setup with distributed sampling.The dataset initialization properly handles both training and validation datasets with appropriate distributed sampling configuration and persistent workers for FSDP compatibility.
369-374: LGTM! Standard training utilities setup.The loss function and LR scheduler initialization are appropriate for the training configuration.
392-418: LGTM! Standard training step implementation.The training step properly handles data movement, forward/backward passes, gradient clipping, and optimizer updates with appropriate timing measurements.
419-450: LGTM! Proper validation loop with distributed reduction.The validation implementation correctly handles distributed loss reduction and checkpoint saving with appropriate model state management.
451-494: LGTM! Comprehensive logging and monitoring.The logging implementation provides detailed metrics for both stdout and WandB with appropriate timing measurements and memory monitoring.
495-514: LGTM! Proper cleanup and profiling output.The cleanup code properly handles memory profiler dumping, WandB finishing, and process group destruction.
There was a problem hiding this comment.
Actionable comments posted: 17
♻️ Duplicate comments (4)
recipes/vit/train.py (3)
113-115: Checkpoint name parsing is brittle in load_auto_resume_checkpointThis is implemented in recipes/vit/checkpoint.py; calling site inherits the risk.
Ensure malformed dirs are skipped (parse+validate parts, handle ValueError/IndexError) as previously suggested in review. Would you like a patch rebased onto current checkpoint.py?
120-123: Add error handling when extracting the dataset tarApply:
- if not Path(cfg.dataset.train.root).exists(): - with tarfile.open(Path(cfg.dataset.train.root).parent.with_suffix(".tar.gz"), "r:gz") as tar: - tar.extractall(Path(cfg.dataset.train.root).parent.parent) + if not Path(cfg.dataset.train.root).exists(): + tar_path = Path(cfg.dataset.train.root).parent.with_suffix(".tar.gz") + if not tar_path.exists(): + raise FileNotFoundError(f"Dataset archive not found: {tar_path}") + try: + with tarfile.open(tar_path, "r:gz") as tar: + tar.extractall(Path(cfg.dataset.train.root).parent.parent) + except (tarfile.TarError, OSError) as e: + raise RuntimeError(f"Failed to extract dataset: {e}")
199-208: Fix resume semantics: enumerate(start=...) doesn’t skip dataThis only offsets the counter; it does not advance the iterator. Use islice to drop items so training truly resumes.
Apply:
+ from itertools import islice @@ - for batch_idx, (input, target) in enumerate( - # Skip to latest step. - infinite_dataloader(train_dataloader, train_sampler), - start=latest_step_idx, - ): + train_iter = islice(infinite_dataloader(train_dataloader, train_sampler), latest_step_idx, None) + for batch_idx, (input, target) in zip(range(latest_step_idx, cfg.training.steps), train_iter):And remove the manual break below. See next hunk.
recipes/vit/README.md (1)
22-31: Prefer venv; remove incorrect PIP_CONSTRAINT guidanceApply:
-#### PIP Install +### PIP Install @@ -``` -cd recipes/vit -# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. -# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. -pip install -r requirements.txt -``` +```bash +cd recipes/vit +# Recommended: create and activate a clean virtual environment (or use the provided Docker image) +python -m venv .venv && source .venv/bin/activate +pip install -r requirements.txt +```
🧹 Nitpick comments (21)
recipes/vit/checkpoint.py (4)
47-51: Typo and clarity in warning messageTypo in the quoted prefix and clearer guidance.
Apply:
- _logger.warning( - f"Checkpoint state dictionary keys ({first_key}) may be prefixed " - "with 'modele.' if converted from a Megatron-FSDP DCP checkpoint." - "Set megatron_fsdp=True to automatically strip the prefix." - ) + _logger.warning( + f"State dict keys (e.g., {first_key}) appear to be prefixed with 'module.' " + "from a Megatron-FSDP checkpoint. Set megatron_fsdp=True to strip it." + )
52-55: Consider safer torch.load usageDefault to
weights_only=Truewhen possible to reduce pickle attack surface; fall back if optimizer/extra payloads are present.Example:
- # Must load with weights_only=False if you have an optimizer state in your checkpoint. - ckpt = torch.load(checkpoint_path, weights_only=False) + # Prefer weights_only=True; fall back if extra payloads are present. + try: + ckpt = torch.load(checkpoint_path, weights_only=True) + except TypeError: + ckpt = torch.load(checkpoint_path, weights_only=False)
91-109: Auto-resume: directory parsing robustnessAssumes every subdir matches
step_<i>_loss_<val>. Guard against malformed names and empty parts to avoid ValueError.Apply:
- subdirs = [x.absolute() for x in Path(cfg.training.checkpoint.path).iterdir() if x.is_dir()] + subdirs = [x.absolute() for x in Path(cfg.training.checkpoint.path).iterdir() if x.is_dir()] + # Keep only dirs matching the expected pattern. + def _parse(dirpath: Path): + parts = dirpath.name.split("_") + if len(parts) == 4 and parts[0] == "step" and parts[2] == "loss": + try: + return int(parts[1]), float(parts[3]) + except ValueError: + return None + return None + parsed = [(d, _parse(d)) for d in subdirs] + subdirs = [d for d, p in parsed if p is not None] if len(subdirs) > 0: - opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1 + opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1 latest_subdir = max( subdirs, key=lambda x: ( - opt_metric_coeff * float(x.name.split("_")[3]) + opt_metric_coeff * float(x.name.split("_")[3]) if cfg.training.checkpoint.resume_from_metric else 0, x.stat().st_mtime, ), )
118-130: Use stateful objects for DCP saveYou can pass
model/optimizerdirectly; avoids intermediate dicts and preserves future extensibility.Apply:
- state_dict = {} - if model is not None: - state_dict["model"] = model.state_dict() - if optimizer is not None: - state_dict["optimizer"] = optimizer.state_dict() - torch.distributed.checkpoint.save(state_dict, checkpoint_id=checkpoint_path) + state = {} + if model is not None: + state["model"] = model + if optimizer is not None: + state["optimizer"] = optimizer + torch.distributed.checkpoint.save(state, checkpoint_id=checkpoint_path)recipes/vit/infer.py (2)
31-33: Docstring mismatchDoc says “Non-distributed inference” but code initializes a process group and device mesh. Update wording.
Apply:
- """ - Inference script for ViT. Non-distributed inference. - """ + """Inference script for ViT (single-rank supported, initializes a distributed context)."""
36-49: Optional: CPU fallback for broader usabilityIf you want tests/usage to run on CPU-only, gate CUDA calls or add a fallback.
Example:
device = "cuda" if torch.cuda.is_available() else "cpu" model = build_vit_model(cfg, device_mesh).to(device) input = torch.randn(1, 3, 224, 224, device=device)recipes/vit/distributed.py (1)
76-77: Destroy process group only if initializedMinor safety.
Apply:
- torch.distributed.destroy_process_group() + if torch.distributed.is_initialized(): + torch.distributed.destroy_process_group()recipes/vit/train.py (3)
308-310: Remove manual break; loop now bounded by zip(range(...))Apply:
- # Terminate if completed training steps. - if batch_idx >= cfg.training.steps: - break + # loop bounded by range(cfg.training.steps) via zip()
220-221: Use non_blocking GPU transfers (paired with pin_memory)Small speed win; consistent with pin_memory above.
Apply:
- input = input.cuda() - target = target.cuda() + input = input.cuda(non_blocking=True) + target = target.cuda(non_blocking=True)
299-306: Update progress bar every step (not only on log intervals)Current update inside the log block undercounts progress.
Apply:
- progress_bar.update(1) + # logging only @@ - # Reset timer. + # Reset timer and advance progress. t_start = time.perf_counter() + if torch.distributed.get_rank() == 0: + progress_bar.update(1)recipes/vit/README.md (6)
3-3: Fix TIMM attribution text and avoid bare URLApply:
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_ +_Adapted ViT model code from [huggingface/pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) by Ross Wightman (@rwightman, Copyright 2020)._
5-22: Normalize heading levels and add languages to fenced blocksAddresses MD001/MD040.
Apply:
-### Pre-Requisites +## Pre-Requisites @@ -#### Docker Container +### Docker Container @@ -``` +```bash docker build -t <image_repo>:<image_tag> .@@
-+bashUtilize plenty of shared memory (--shm-size) to support loading large batches of image data!
docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
33-41: Heading level + fenced language + DCP link styleApply:
-### Training a Vision Transformer +## Training a Vision Transformer @@ -``` +```bash torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit@@
-which will train on a local tiny 5-class version of ImageNet (super-tiny-imagenet-5) and save auto-resumable Torch DCP checkpoints to thetraining.checkpoint.pathdirectory.
+which will train on a local tiny 5-class version of ImageNet (super-tiny-imagenet-5) and save auto-resumable Torch DCP checkpoints to thetraining.checkpoint.pathdirectory.--- `49-66`: **Add languages to fenced blocks** Apply: ```diff -#### Checkpoint Conversion +### Checkpoint Conversion @@ -``` +```bash python -m torch.distributed.checkpoint.format_utils --help usage: format_utils.py [-h] {torch_to_dcp,dcp_to_torch} src dst @@ -``` +```
67-76: Add languages to fenced blocksApply:
-``` +```bash python -m torch.distributed.checkpoint.format_utils dcp_to_torch step_75_loss_1.725 torch_ckpt_test.pt@@
-+python
from torch.distributed.checkpoint.format_utils import dcp_to_torch_save, torch_save_to_dcp
@@
torch_save_to_dcp(TORCH_SAVE_CHECKPOINT_PATH, f"{CHECKPOINT_DIR}_new")
108-108: Avoid bare URL; use a linkApply:
-TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, you can look into: https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/ +TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, see: [PyTorch blog: HuggingFace safetensors support in DCP](https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/).recipes/vit/test_train.py (2)
53-55: Redundant resume override.
vit_resume_config.training.steps = 10repeats the already-set value from the compose overrides. Remove to reduce noise.- vit_resume_config = deepcopy(vit_config) - vit_resume_config.training.steps = 10 + vit_resume_config = deepcopy(vit_config)
58-61: Make the checkpoint assertion robust.Depending on the saver (best/latest/step), more than one directory can exist after the first run. Assert “at least one”.
- assert sum(1 for item in (Path(tmp_path) / "ckpt").iterdir() if item.is_dir()) == 1, ( - "Expected 1 checkpoint with 10 training steps and validation interval of 10." - ) + assert sum(1 for item in (Path(tmp_path) / "ckpt").iterdir() if item.is_dir()) >= 1, ( + "Expected at least 1 checkpoint with 10 training steps and validation interval of 10." + )recipes/vit/vit.py (3)
1466-1473: Avoid fragile access to TransformerEngine internals.Attribute names in TE are not stable. Gate rescaling by attribute presence to prevent AttributeError when
init_variance_rescale=True.- for layer_id, layer in enumerate(self.blocks): - if self.block_fn == TransformerLayer: - rescale(layer.self_attention.proj.weight.data, layer_id + 1) - rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1) + for layer_id, layer in enumerate(self.blocks): + if self.block_fn == TransformerLayer: + if hasattr(layer, "self_attention") and hasattr(layer.self_attention, "proj") and hasattr( + layer.self_attention.proj, "weight" + ): + rescale(layer.self_attention.proj.weight.data, layer_id + 1) + if hasattr(layer, "layernorm_mlp") and hasattr(layer.layernorm_mlp, "fc2_weight"): + rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1) else: rescale(layer.attn.proj.weight.data, layer_id + 1) rescale(layer.mlp.fc2.weight.data, layer_id + 1)
700-703: Respect tuple img_size.Current code overwrites tuples; support both int and (H, W).
- img_size = (img_size, img_size) + if isinstance(img_size, int): + img_size = (img_size, img_size) + else: + assert len(img_size) == 2, "img_size tuple must be (H, W)"
1062-1064: Raise explicit error on unknown pool type.
assert not pool_typecan be optimized out; raise instead.- else: - assert not pool_type, f"Unknown pool type {pool_type}" + else: + raise AssertionError(f"Unknown pool type {pool_type}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
recipes/vit/README.md(1 hunks)recipes/vit/checkpoint.py(1 hunks)recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)recipes/vit/distributed.py(1 hunks)recipes/vit/imagenet_dataset.py(1 hunks)recipes/vit/infer.py(1 hunks)recipes/vit/test_infer.py(1 hunks)recipes/vit/test_train.py(1 hunks)recipes/vit/train.py(1 hunks)recipes/vit/vit.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- recipes/vit/config/defaults.yaml
- recipes/vit/config/vit_te_base_patch16_224.yaml
- recipes/vit/imagenet_dataset.py
- recipes/vit/config/vit_base_patch16_224.yaml
🧰 Additional context used
🧬 Code graph analysis (4)
recipes/vit/infer.py (3)
recipes/vit/checkpoint.py (1)
load_torch_checkpoint(27-54)recipes/vit/distributed.py (1)
initialize_distributed(23-77)recipes/vit/vit.py (1)
build_vit_model(70-98)
recipes/vit/test_infer.py (4)
recipes/vit/checkpoint.py (1)
save_dcp_checkpoint(118-129)recipes/vit/distributed.py (1)
initialize_distributed(23-77)recipes/vit/infer.py (1)
main(30-49)recipes/vit/vit.py (1)
build_vit_model(70-98)
recipes/vit/train.py (5)
recipes/vit/checkpoint.py (2)
load_auto_resume_checkpoint(73-115)save_auto_resumable_checkpoint(132-163)recipes/vit/distributed.py (1)
initialize_distributed(23-77)recipes/vit/imagenet_dataset.py (2)
ImageNetDataset(250-337)infinite_dataloader(38-45)recipes/vit/imagenet_utils.py (2)
transforms_imagenet_eval(946-1033)transforms_imagenet_train(775-943)recipes/vit/vit.py (1)
build_vit_model(70-98)
recipes/vit/test_train.py (1)
recipes/vit/train.py (1)
main(44-325)
🪛 LanguageTool
recipes/vit/README.md
[grammar] ~108-~108: There might be a mistake here.
Context: ...rt-in-pytorch-distributed-checkpointing/
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
53-53: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
69-69: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
75-75: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
108-108: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (4)
recipes/vit/test_infer.py (1)
51-59: DCP→Torch conversion OKThe staged DCP save and conversion to Torch format looks correct and aligns with the loader’s expectations.
recipes/vit/distributed.py (1)
63-71: Persist flattened sub-meshes to avoid GC
- Capture each flattened sub-mesh instead of discarding it (e.g.
dp_mesh = device_mesh[("dp_inter","dp_shard")]._flatten("dp") dp_cp_shard_mesh = device_mesh[("dp_shard","cp")]._flatten("dp_cp_shard") hsdp_mesh = device_mesh[("dp_inter","dp_shard","cp")]._flatten("hsdp") device_mesh._submeshes = {"dp": dp_mesh, "dp_cp_shard": dp_cp_shard_mesh, "hsdp": hsdp_mesh})
- Check for a supported public API in place of the private
_flattenand update to use it if available.recipes/vit/train.py (1)
131-136: DeviceMesh “dp” sub-mesh is available
initialize_distributedflattens("dp_inter","dp_shard")into"dp"before yielding the mesh, sodevice_mesh["dp"].size()anddevice_mesh["dp"].get_local_rank()are guaranteed and will not raise a KeyError.recipes/vit/vit.py (1)
81-99: Ignore meta-device init suggestion —with torch.device('meta')is a supported context manager (including for “meta”)[cite] and will reliably allocate all submodule parameters on meta; no change needed.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (11)
recipes/vit/test_infer.py (3)
28-29: Skip on CPU-only runners to avoid hard CUDA dependency in CI.-@pytest.mark.parametrize("config_name", ["vit_base_patch16_224", "vit_te_base_patch16_224"]) +@pytest.mark.parametrize("config_name", ["vit_base_patch16_224", "vit_te_base_patch16_224"]) +@pytest.mark.skipif(not torch.cuda.is_available(), reason="requires CUDA")
64-65: Prefer an undecorated run_inference(cfg) that returns output; assert shape.Current test only checks no-exception. Refactor infer.py to expose run_inference(cfg)->Tensor and use it here.
-# Run inference. -main(vit_config) +# Run inference. +from .infer import run_inference # move import to header +out = run_inference(vit_config) +assert out.shape[0] == 1
22-25: Fix imports to be package-relative; current form breaks under pytest from repo root.Make recipes/vit a package and use relative imports.
-from checkpoint import save_dcp_checkpoint -from distributed import initialize_distributed -from infer import main -from vit import build_vit_model +from .checkpoint import save_dcp_checkpoint +from .distributed import initialize_distributed +from .infer import main # or run_inference (see below) +from .vit import build_vit_modelAlso add:
# recipes/vit/__init__.py """ViT recipe package."""recipes/vit/checkpoint.py (5)
2-2: Use canonical SPDX identifier "Apache-2.0".-# SPDX-License-Identifier: LicenseRef-Apache2 +# SPDX-License-Identifier: Apache-2.0
37-45: Don’t use str.strip for prefix removal; it corrupts keys.strip() removes any of the characters, not the exact prefix.
- checkpoint = torch.load(checkpoint_path, weights_only=False) - # Remove the "module." prefix from the keys of checkpoints - # derived from Megatron-FSDP. - model_checkpoint = {(k.strip("module.") if megatron_fsdp else k): v for k, v in checkpoint["model"].items()} - # Warn about Megatron-FSDP checkpoints. - first_key = next(iter(model_checkpoint)) - if first_key.startswith("module.") and not megatron_fsdp: + checkpoint = torch.load(checkpoint_path, weights_only=False) + def _rm_prefix(name: str, prefix: str = "module.") -> str: + return name[len(prefix):] if name.startswith(prefix) else name + # Handle Megatron‑FSDP "module." prefix correctly. + raw_keys = list(checkpoint["model"].keys()) + model_checkpoint = {(_rm_prefix(k) if megatron_fsdp else k): v for k, v in checkpoint["model"].items()} + # Warn if keys appear prefixed but user forgot megatron_fsdp=True. + if (not megatron_fsdp) and any(k.startswith("module.") for k in raw_keys): _logger.warning( f"Checkpoint state dictionary keys ({first_key}) may be prefixed " "with 'module.' if converted from a Megatron-FSDP DCP checkpoint." "Set megatron_fsdp=True to automatically strip the prefix." )
56-70: Robust DCP load: support None objects and load stateful objects directly.Current code will crash when optimizer is None and unnecessarily materializes state_dicts.
-def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None): +def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None): """Load a Torch DCP checkpoint from checkpoint_path into model and optimizer. Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html """ - # Load model and optimizer checkpoints. - state_dict = {} - if model is not None: - state_dict["model"] = model.state_dict() - if optimizer is not None: - state_dict["optimizer"] = optimizer.state_dict() - torch.distributed.checkpoint.load(state_dict, checkpoint_id=checkpoint_path) - model.load_state_dict(state_dict["model"]) - optimizer.load_state_dict(state_dict["optimizer"]) + state = {} + if model is not None: + state["model"] = model + if optimizer is not None: + state["optimizer"] = optimizer + torch.distributed.checkpoint.load(state, checkpoint_id=checkpoint_path)
110-112: Guard rank-0 logging; get_rank() raises if PG isn’t initialized.- if torch.distributed.get_rank() == 0: + if torch.distributed.is_available() and torch.distributed.is_initialized() and torch.distributed.get_rank() == 0: _logger.info(f"Loaded latest model and optimizer checkpoints from: {latest_subdir}")
153-162: Avoid world-writable 0o777 on checkpoints; use 0o775 and guard logging.0o777 is risky on shared filesystems.
- # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root. - mode = 0o777 + # Relax checkpoint permissions for collaboration; avoid world-writable. + mode = 0o775 for dirpath, _, filenames in os.walk(ckpt_dir): # Change current directory perms. os.chmod(dirpath, mode) for filename in filenames: # Change file perms. file_path = Path(dirpath) / filename os.chmod(file_path, mode) - if torch.distributed.get_rank() == 0: + if torch.distributed.is_available() and torch.distributed.is_initialized() and torch.distributed.get_rank() == 0: _logger.info(f"Saved validated checkpoint to: {ckpt_dir}")recipes/vit/config/vit_base_patch16_224.yaml (3)
76-76: num_classes must match dataset (5); 100000 will break head shape.- num_classes: 100000 + num_classes: 5
46-56: Align FSDP strategy types with defaults (strings, not ints).Hydra/OmegaConf type mismatches will cause parsing/branching issues.
init_model_with_meta_device: true - zero_dp_strategy: 3 + zero_dp_strategy: "optim_grads_params" fsdp_unit_modules: - vit.Block - vit.PatchEmbed - torch.nn.LayerNorm - torch.nn.Linear use_hybrid_fsdp: true - outer_dp_sharding_strategy: 1 + outer_dp_sharding_strategy: "optim" grad_reduce_in_fp32: false preserve_fp32_weights: trueIf defaults.yaml uses different enum names, update to those exact strings.
80-90: Point dataset paths to bundled data or parameterize a data_root.As-is, example runs will fail to find files.
Option A (absolute within repo):
- root: "./data/super-tiny-imagenet-5/train" - class_map: "./data/super-tiny-imagenet-5/words.txt" + root: "recipes/vit/data/super-tiny-imagenet-5/train" + class_map: "recipes/vit/data/super-tiny-imagenet-5/words.txt" ... - root: "./data/super-tiny-imagenet-5/val" - class_map: "./data/super-tiny-imagenet-5/words.txt" - label_map: "./data/super-tiny-imagenet-5/val/val_annotations.txt" + root: "recipes/vit/data/super-tiny-imagenet-5/val" + class_map: "recipes/vit/data/super-tiny-imagenet-5/words.txt" + label_map: "recipes/vit/data/super-tiny-imagenet-5/val/val_annotations.txt"Option B (preferred: configurable root):
+dataset: + data_root: "recipes/vit/data/super-tiny-imagenet-5" train: - root: "./data/super-tiny-imagenet-5/train" - class_map: "./data/super-tiny-imagenet-5/words.txt" + root: "${dataset.data_root}/train" + class_map: "${dataset.data_root}/words.txt" val: - root: "./data/super-tiny-imagenet-5/val" - class_map: "./data/super-tiny-imagenet-5/words.txt" - label_map: "./data/super-tiny-imagenet-5/val/val_annotations.txt" + root: "${dataset.data_root}/val" + class_map: "${dataset.data_root}/words.txt" + label_map: "${dataset.data_root}/val/val_annotations.txt"
🧹 Nitpick comments (4)
recipes/vit/test_infer.py (2)
18-21: Import torch if you gate the test on CUDA.Required for skip marker below.
import pytest from hydra import compose, initialize_config_dir from torch.distributed.checkpoint.format_utils import dcp_to_torch_save +import torch
62-63: Assert checkpoint conversion succeeded for a minimal correctness signal.dcp_to_torch_save(Path(tmp_path) / "test_infer_dcp_checkpoint", test_ckpt_path) +assert test_ckpt_path.exists()recipes/vit/checkpoint.py (1)
117-129: Optional: Save stateful objects directly with DCP.Keeps save/load symmetric and lets sharded state handle itself.
- state_dict = {} - if model is not None: - state_dict["model"] = model.state_dict() - if optimizer is not None: - state_dict["optimizer"] = optimizer.state_dict() - torch.distributed.checkpoint.save(state_dict, checkpoint_id=checkpoint_path) + state = {} + if model is not None: + state["model"] = model + if optimizer is not None: + state["optimizer"] = optimizer + torch.distributed.checkpoint.save(state, checkpoint_id=checkpoint_path)recipes/vit/config/vit_base_patch16_224.yaml (1)
67-75: Default inference flag megatron_fsdp: consider false for quickstart.Most new users will run non‑Megatron checkpoints first; setting true by default can confuse.
format: "torch" - megatron_fsdp: true + megatron_fsdp: false
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
recipes/vit/checkpoint.py(1 hunks)recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)recipes/vit/infer.py(1 hunks)recipes/vit/test_infer.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- recipes/vit/infer.py
- recipes/vit/config/vit_te_base_patch16_224.yaml
- recipes/vit/config/defaults.yaml
🧰 Additional context used
🧬 Code graph analysis (1)
recipes/vit/test_infer.py (4)
recipes/vit/checkpoint.py (1)
save_dcp_checkpoint(117-128)recipes/vit/distributed.py (1)
initialize_distributed(23-77)recipes/vit/infer.py (1)
main(30-51)recipes/vit/vit.py (1)
build_vit_model(70-98)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
24e0a65 to
0fffd7b
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1103 +/- ##
==========================================
- Coverage 79.57% 79.57% -0.01%
==========================================
Files 160 160
Lines 11826 11826
==========================================
- Hits 9411 9410 -1
- Misses 2415 2416 +1 |
43430ee to
c09c4bb
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (8)
recipes/vit/checkpoint.py (2)
57-73: Simplify DCP checkpoint loading by passing stateful objects directlyThe current implementation unnecessarily creates intermediate state dicts. DCP supports loading directly into stateful objects.
Apply this diff:
def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None): """Load a Torch DCP checkpoint from checkpoint_path into model and optimizer. Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html """ - # Load model and optimizer checkpoints. - state_dict = {} - if model is not None: - state_dict["model"] = model.state_dict() - if optimizer is not None: - state_dict["optimizer"] = optimizer.state_dict() - torch.distributed.checkpoint.load(state_dict, checkpoint_id=checkpoint_path) - if model is not None: - model.load_state_dict(state_dict["model"]) - if optimizer is not None: - optimizer.load_state_dict(state_dict["optimizer"]) + state = {} + if model is not None: + state["model"] = model + if optimizer is not None: + state["optimizer"] = optimizer + if state: # Only load if there's something to load + torch.distributed.checkpoint.load(state, checkpoint_id=checkpoint_path)
156-163: Overly permissive 0777 permissions on checkpointsWorld-writable permissions (0777) are a security risk on shared filesystems. Consider using more restrictive permissions.
Apply this diff:
- # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root. - mode = 0o777 + # Relax checkpoint permissions for collaboration; avoid world-writable. + mode = 0o775 for dirpath, _, filenames in os.walk(ckpt_dir): # Change current directory perms. os.chmod(dirpath, mode) for filename in filenames: # Change file perms. file_path = Path(dirpath) / filename os.chmod(file_path, mode)recipes/vit/distributed.py (1)
45-51: Initialize with explicit backend and fix LOCAL_RANK fallbackThe current implementation has several issues:
- Backend should be explicit (NCCL for CUDA)
- Fallback to global rank for LOCAL_RANK is incorrect on multi-node setups
- Missing CUDA availability check and device bounds validation
Apply this diff:
- torch.distributed.init_process_group() + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required for this recipe; no GPU visible.") + torch.distributed.init_process_group(backend="nccl", init_method="env://") - local_rank = int(os.getenv("LOCAL_RANK", torch.distributed.get_rank())) - torch.cuda.set_device(local_rank) + local_rank = int(os.getenv("LOCAL_RANK", -1)) + if local_rank < 0: + world_rank = torch.distributed.get_rank() + local_rank = world_rank % torch.cuda.device_count() + assert 0 <= local_rank < torch.cuda.device_count(), ( + f"Computed local_rank {local_rank} out of range for {torch.cuda.device_count()} visible devices." + ) + torch.cuda.set_device(local_rank)recipes/vit/vit.py (2)
1556-1556: Positional embedding calls need null guards.Previous reviews correctly identified that
self.pos_embed(x)can fail whenpos_embed=Falsesinceself.pos_embedwould beNone.Also applies to: 1640-1640
482-486: Buffer registration issue with qkv_bias parameter.The code registers
Noneas a buffer whenqkv_bias=True, which is incorrect. Buffers should be tensor objects.recipes/vit/README.md (2)
26-31: Remove incorrect PIP_CONSTRAINT advice; show venv-based install.-``` -cd recipes/vit -# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. -# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. -pip install -r requirements.txt -``` +```bash +cd recipes/vit +# Recommended: create and activate a clean virtual environment (or use the provided Docker image) +python -m venv .venv && source .venv/bin/activate +pip install -r requirements.txt +```
41-41: Fix DCP docs host to pytorch.org (not docs.pytorch.org).-... save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) ... +... save auto-resumable [PyTorch Distributed Checkpoint (DCP)](https://pytorch.org/docs/stable/distributed.checkpoint.html) ...- Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html + Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html- Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html + Docs: https://pytorch.org/docs/stable/distributed.checkpoint.htmlAlso applies to: 62-63, 76-77
recipes/vit/train.py (1)
183-187: Resume data iteration correctly—enumerate(start=...) doesn’t skip items.Current code reprocesses batches after resume. Use islice to drop the first N items.
- for batch_idx, sample in enumerate( - # Skip to latest step. - infinite_dataloader(train_dataloader, train_sampler), - start=latest_step_idx, - ): + import itertools + # Skip already-processed steps to avoid duplication on resume. + train_iter = itertools.islice( + infinite_dataloader(train_dataloader, train_sampler), + latest_step_idx, + None, + ) + for batch_idx, sample in enumerate(train_iter, start=latest_step_idx):
🧹 Nitpick comments (11)
recipes/vit/distributed.py (2)
56-59: Fix validation error message to exclude tp from productThe error message incorrectly includes
tpin the product calculation, but the validation only checksdp_outer * dp_shard * cp.Apply this diff:
- f"Invalid parallelism sizes: dp_outer({dp_outer}) * dp_shard({dp_shard}) * cp({cp}) * tp({tp}) != world_size({torch.distributed.get_world_size()})" + f"Invalid parallelism sizes: dp_outer({dp_outer}) * dp_shard({dp_shard}) * cp({cp}) != world_size({torch.distributed.get_world_size()}) [tp={tp} is separate]"
74-79: Consider adding type hints for better code clarityAdding type hints would improve code maintainability and IDE support.
Consider adding type hints to the function signature:
+from typing import Generator +from torch.distributed.device_mesh import DeviceMesh + @contextmanager def initialize_distributed( dp_outer: int = 1, dp_shard: int = 1, cp: int = 1, tp: int = 1, -): +) -> Generator[DeviceMesh, None, None]:recipes/vit/checkpoint.py (1)
99-110: Potential IndexError when parsing checkpoint directory namesThe code assumes checkpoint directories follow the exact format
step_<idx>_loss_<value>but doesn't validate this, which could cause IndexErrors if directories don't match the expected format.Add validation for directory name format:
# We expect a checkpoint named as: step_<step_idx>_loss_<loss_value>. # Get the latest step, the directory with the most recent modification time. opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1 + + def parse_checkpoint_dir(path): + """Parse checkpoint directory name, return None if invalid format.""" + try: + parts = path.name.split("_") + if len(parts) >= 4 and parts[0] == "step" and parts[2] == "loss": + return int(parts[1]), float(parts[3]) + except (ValueError, IndexError): + pass + return None + + valid_subdirs = [(d, parse_checkpoint_dir(d)) for d in subdirs] + valid_subdirs = [(d, parsed) for d, parsed in valid_subdirs if parsed is not None] + + if not valid_subdirs: + _logger.warning("No valid checkpoint directories found") + return 0 + latest_subdir = max( - subdirs, + valid_subdirs, key=lambda x: ( - opt_metric_coeff * float(x.name.split("_")[3]) + opt_metric_coeff * x[1][1] if cfg.training.checkpoint.resume_from_metric else 0, - x.stat().st_mtime, + x[0].stat().st_mtime, ), - ) + )[0]recipes/vit/config/defaults.yaml (1)
50-50: Consider documenting FSDP zero_dp_strategy optionsThe
zero_dp_strategy: "optim_grads_params"setting affects memory usage significantly. Consider adding a comment explaining the available options.Add documentation for clarity:
init_model_with_meta_device: true - zero_dp_strategy: "optim_grads_params" + zero_dp_strategy: "optim_grads_params" # Options: "none", "optim", "grads", "optim_grads", "params", "optim_params", "grads_params", "optim_grads_params"recipes/vit/README.md (2)
5-5: Heading level jump (MD001).Promote “Pre-Requisites” to H2 so levels increment by one.
-### Pre-Requisites +## Prerequisites
11-13: Add languages to fenced code blocks (MD040).Specify bash/python for better rendering and copy/paste UX.
-``` +```bash docker build -t <image_repo>:<image_tag> .-
+bashUtilize plenty of shared memory (--shm-size) to support loading large batches of image data!
docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
-``` +```bash torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit-
+text
python -m torch.distributed.checkpoint.format_utils --help
usage: format_utils.py [-h] {torch_to_dcp,dcp_to_torch} src dst
@@-``` +```bash python -m torch.distributed.checkpoint.format_utils dcp_to_torch step_75_loss_1.725 torch_ckpt_test.pt-
python +python
from torch.distributed.checkpoint.format_utils import (
dcp_to_torch_save,
torch_save_to_dcp,
)
@@-```python +```python from checkpoint import load_torch_checkpoint @@Also applies to: 17-20, 37-39, 105-117, 121-138, 145-157 </blockquote></details> <details> <summary>recipes/vit/train.py (3)</summary><blockquote> `127-136`: **Enable pinned memory for faster H2D; guard persistent_workers (already done).** Pin host memory and use non_blocking copies below. ```diff train_dataloader = DataLoader( beans_train_dataset, batch_size=cfg.dataset.train.batch_size, sampler=train_sampler, num_workers=cfg.dataset.num_workers, @@ - persistent_workers=(cfg.dataset.num_workers > 0), + persistent_workers=(cfg.dataset.num_workers > 0), + pin_memory=True, ) @@ val_dataloader = DataLoader( beans_val_dataset, batch_size=cfg.dataset.val.batch_size, sampler=val_sampler, num_workers=cfg.dataset.num_workers, @@ - persistent_workers=(cfg.dataset.num_workers > 0), + persistent_workers=(cfg.dataset.num_workers > 0), + pin_memory=True, )Also update the H2D transfers:
- input = input.cuda() - target = target.cuda() + input = input.cuda(non_blocking=True) + target = target.cuda(non_blocking=True)Also applies to: 149-158
193-201: Zero gradients with set_to_none=True.Better perf and avoids residual grads.
- optimizer.zero_grad() + optimizer.zero_grad(set_to_none=True)
254-255: Optional: synchronize before timing for stable throughput.Ensures CUDA kernels complete before measuring.
- t_end = time.perf_counter() + torch.cuda.synchronize() + t_end = time.perf_counter()recipes/vit/beans.py (1)
27-35: infinite_dataloader is fine; consider exposing start offset.If other callers need resume-on-step, optionally accept a start parameter and skip via itertools.islice.
-def infinite_dataloader(dataloader, sampler): +def infinite_dataloader(dataloader, sampler, *, start: int = 0): @@ - while True: + while True: sampler.set_epoch(epoch) # Update epoch for proper shuffling - for batch in dataloader: + for i, batch in enumerate(dataloader): + if epoch == 0 and i < start: + continue yield batchrecipes/vit/config/vit_base_patch16_224.yaml (1)
68-76: Inference defaults: consider megatron_fsdp=false when format="torch".Safer OOTB since non-distributed torch checkpoints don’t require Megatron-FSDP.
checkpoint: path: null # Load a DCP->Torch converted checkpoint for inference without Megatron-FSDP. # Otherwise, set this to "torch_dcp" if using Megatron-FSDP for inference. # If the checkpoint was not trained with Megatron-FSDP, then set megatron_fsdp to false. format: "torch" - megatron_fsdp: true + megatron_fsdp: false
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
recipes/vit/.gitignore(1 hunks)recipes/vit/.ruff.toml(1 hunks)recipes/vit/Dockerfile(1 hunks)recipes/vit/LICENSE(1 hunks)recipes/vit/README.md(1 hunks)recipes/vit/beans.py(1 hunks)recipes/vit/checkpoint.py(1 hunks)recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)recipes/vit/distributed.py(1 hunks)recipes/vit/infer.py(1 hunks)recipes/vit/requirements.txt(1 hunks)recipes/vit/test_infer.py(1 hunks)recipes/vit/test_train.py(1 hunks)recipes/vit/train.py(1 hunks)recipes/vit/vit.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- recipes/vit/.gitignore
🚧 Files skipped from review as they are similar to previous changes (6)
- recipes/vit/.ruff.toml
- recipes/vit/Dockerfile
- recipes/vit/infer.py
- recipes/vit/test_infer.py
- recipes/vit/config/vit_te_base_patch16_224.yaml
- recipes/vit/test_train.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-09T23:57:34.120Z
Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1139
File: models/geneformer/src/geneformer/convert.py:1-3
Timestamp: 2025-09-09T23:57:34.120Z
Learning: NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier instead of the standard "Apache-2.0" identifier in their codebase header files.
Applied to files:
recipes/vit/checkpoint.py
📚 Learning: 2025-08-28T16:40:04.315Z
Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1078
File: recipes/esm2_native_te_mfsdp/train_ddp.py:103-108
Timestamp: 2025-08-28T16:40:04.315Z
Learning: PyTorch DistributedDataParallel constructor accepts a device_mesh parameter in recent versions, which supports advanced distributed training scenarios and nvFSDP configurations.
Applied to files:
recipes/vit/distributed.py
🧬 Code graph analysis (1)
recipes/vit/train.py (4)
recipes/vit/beans.py (2)
BeansDataset(37-65)infinite_dataloader(27-34)recipes/vit/checkpoint.py (2)
load_auto_resume_checkpoint(75-117)save_auto_resumable_checkpoint(134-165)recipes/vit/distributed.py (1)
initialize_distributed(23-85)recipes/vit/vit.py (1)
build_vit_model(77-105)
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
121-121: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
159-159: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests (recipes/vit, svcbionemo023/bionemo-framework:pytorch25.06-py3-squashed-zstd)
🔇 Additional comments (12)
recipes/vit/LICENSE (1)
1-224: LGTM! Dual licensing properly documentedThe dual licensing (Apache 2.0 and MIT) is clearly presented with appropriate copyright attributions for both Ross Wightman (2019) and AIR Lab Makerere University (2020).
recipes/vit/checkpoint.py (3)
2-2: Using non-standard SPDX license identifierBased on the retrieved learning, NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier in their codebase, so this is correct for this repository.
43-43: LGTM! Using removeprefix correctlyThe code correctly uses
removeprefix("module.")which only removes the exact prefix if present, avoiding the issues that would arise from usingstrip().
113-114: LGTM! Proper rank-0 logging guardsThe code correctly checks if the distributed process group is initialized before calling
get_rank(), preventing potential runtime errors.Also applies to: 164-165
recipes/vit/config/defaults.yaml (2)
1-92: LGTM! Clean configuration structure with dataset-driven class countThe configuration properly uses
${dataset.num_classes}to drive the model's output dimension from the dataset configuration, avoiding hardcoded values and ensuring consistency.
77-77: Consider whether num_classes=3 is the intended defaultThe default
num_classes: 3seems unusual. Most standard datasets have different class counts (e.g., ImageNet has 1000, CIFAR-10 has 10). Verify if this is intentional or should be adjusted.Is
num_classes: 3the correct default for the super-tiny-imagenet dataset mentioned in the PR description, or should this be 5 to match the "5 classes × 50 training images" mentioned?recipes/vit/vit.py (3)
69-75: TransformerEngine import handling is properly defensive.The try/except block correctly handles the optional TransformerEngine dependency and sets the appropriate flag for conditional usage throughout the code.
77-105: Model factory function handles edge cases well.The
build_vit_modelfunction properly handles meta device initialization, TransformerEngine configuration, and channels_last memory format. The assertions and conditional logic are appropriate.
1264-1306: VisionTransformer constructor parameter validation is thorough.The assertions for global_pool values, class_token requirements, and dimension divisibility provide good validation. The parameter documentation is comprehensive.
recipes/vit/README.md (1)
3-3: Fix TIMM capitalization.Use “TIMM” (or “timm”), not “TImM”.
[raise_nitpick_refactor]-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_ +_Adapted ViT model code from huggingface/pytorch-image-models (TIMM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_recipes/vit/beans.py (1)
42-55: Rank-0 logging looks good.Init and metadata logging are correct and safe under DDP.
recipes/vit/config/vit_base_patch16_224.yaml (1)
77-86: num_classes wiring looks correct.Model uses ${dataset.num_classes}; dataset.num_classes=3 matches Beans.
Also applies to: 10-10
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (7)
recipes/vit/test_infer.py (2)
28-38: Skip on CPU-only runners or support CPU.The test forces CUDA usage which will cause failures on CPU-only runners.
Apply this diff to guard the test:
+import torch + @pytest.mark.parametrize("config_name", ["vit_base_patch16_224", "vit_te_base_patch16_224"]) +@pytest.mark.skipif(not torch.cuda.is_available(), reason="requires CUDA") def test_infer(monkeypatch, tmp_path, config_name):
22-25: Imports rely on current working directory; make them package-relative.The imports use bare module names which will break when pytest runs from the project root. Additionally, the import from
infercallsmainwhich is decorated with@hydra.mainand expects zero args.Apply this diff to fix the import and calling issues:
-from checkpoint import save_dcp_checkpoint -from distributed import initialize_distributed -from infer import main -from vit import build_vit_model +from .checkpoint import save_dcp_checkpoint +from .distributed import initialize_distributed +from .infer import run_inference # assuming run_inference is extracted from main +from .vit import build_vit_modelAlso, Line 65 calls
main(vit_config)butmainis decorated with@hydra.mainand expects zero arguments. The inference logic should be extracted into an undecoratedrun_inference(cfg)function.Check if
run_inferencefunction exists or ifinfer.pyneeds refactoring:#!/bin/bash # Check if run_inference function exists in infer.py rg -n "def run_inference" recipes/vit/infer.py || echo "run_inference function not found" rg -n "@hydra.main" recipes/vit/infer.py -A 3recipes/vit/README.md (2)
26-31: Remove incorrect PIP_CONSTRAINT guidance; recommend venv instead.The guidance suggesting
PIP_CONSTRAINT=to ignore dependency conflicts is incorrect.PIP_CONSTRAINTrequires a constraints file path, and setting it empty doesn't "ignore conflicts."Apply this diff to fix the installation guidance:
-``` +```bash cd recipes/vit -# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. -# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. +# Recommended: create and activate a clean virtual environment (or use the provided Docker image) +python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt--- `41-41`: **Fix PyTorch DCP docs URL (domain is pytorch.org, not docs.pytorch.org).** The PyTorch documentation links point to a non-canonical host. Apply this diff to fix the URLs throughout the document: ```diff -which will train on the [`AI-Lab-Makerere/ibean`](https://github.com/AI-Lab-Makerere/ibean/) (HuggingFace: [`AI-Lab-Makerere/beans`](https://huggingface.co/datasets/AI-Lab-Makerere/beans)) dataset and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory. +which will train on the [`AI-Lab-Makerere/ibean`](https://github.com/AI-Lab-Makerere/ibean/) (HuggingFace: [`AI-Lab-Makerere/beans`](https://huggingface.co/datasets/AI-Lab-Makerere/beans)) dataset and save auto-resumable [PyTorch Distributed Checkpoint (DCP)](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.Also fix at Lines 62-63 and 76-77:
- Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html + Docs: https://pytorch.org/docs/stable/distributed.checkpoint.htmlrecipes/vit/train.py (1)
183-187: Fix dataloader resume logic to avoid duplicating/skipping samples.Passing
start=latest_step_idxtoenumerateonly shifts the batch index—it doesn't advance the iterator. This means when resuming from a checkpoint, the model will re-process samples that were already seen before the checkpoint.The current implementation can lead to:
- Data duplication when resuming training
- Non-deterministic training results
- Incorrect learning curves
Apply this diff to fix the dataloader resume logic:
+ resume_epoch = latest_step_idx // steps_per_epoch if steps_per_epoch > 0 else 0 + train_sampler.set_epoch(resume_epoch) + + # Create iterator and skip already processed batches + dataloader_iter = iter(infinite_dataloader(train_dataloader, train_sampler)) + for _ in range(latest_step_idx): + next(dataloader_iter, None) + for batch_idx, sample in enumerate( - # Skip to latest step. - infinite_dataloader(train_dataloader, train_sampler), - start=latest_step_idx, + dataloader_iter, + start=latest_step_idx, ):recipes/vit/test_train.py (2)
18-21: Guard runtime deps (CUDA/TE) and prep for ephemeral port use — add imports.To support the skip guards and a free port, add these imports.
from pathlib import Path +import importlib.util +import socket +import torch import pytest from hydra import compose, initialize_config_dir
30-35: Skip when CUDA/TE unavailable and avoid hardcoded port.Prevents spurious CI failures and port collisions on shared runners.
- # Set required environment variables for distributed training + # Runtime guards + if not torch.cuda.is_available(): + pytest.skip("CUDA is required for this test.") + if config_name.startswith("vit_te_") and importlib.util.find_spec("transformer_engine") is None: + pytest.skip("TransformerEngine not installed; skipping TE config.") + + # Set required environment variables for distributed training monkeypatch.setenv("LOCAL_RANK", "0") monkeypatch.setenv("RANK", "0") monkeypatch.setenv("WORLD_SIZE", "1") - monkeypatch.setenv("MASTER_ADDR", "localhost") - monkeypatch.setenv("MASTER_PORT", "29500") + monkeypatch.setenv("MASTER_ADDR", "127.0.0.1") + with socket.socket() as s: + s.bind(("127.0.0.1", 0)) + monkeypatch.setenv("MASTER_PORT", str(s.getsockname()[1]))
🧹 Nitpick comments (5)
recipes/vit/README.md (1)
11-20: Add language specification to fenced code blocks.The markdown linter indicates several fenced code blocks are missing language specifications.
Apply this diff to add language specifications:
-``` +```bash docker build -t <image_repo>:<image_tag> .To launch a Docker container from the image, run the following command:
-
+bashUtilize plenty of shared memory (--shm-size) to support loading large batches of image data!
docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
Also apply to Lines 37-37, 105-105, and 121-121 by adding
bashlanguage specification.recipes/vit/test_train.py (4)
41-52: Make dataloading hermetic for tests (no worker procs).Avoids persistent_workers and multiprocessing overhead/flakiness in CI.
overrides=[ "++training.steps=5", "++training.val_interval=5", "++training.log_interval=1", f"++training.checkpoint.path={training_ckpt_path}", "++profiling.torch_memory_profile=false", "++profiling.wandb=false", f"++fsdp.init_model_with_meta_device={init_model_with_meta_device}", + "++dataset.num_workers=0", ],
56-63: Loosen checkpoint assertion and verify auto-resume is idempotent.Exact directory count can be config/impl-dependent; assert ≥1 and ensure the second run doesn’t create extras.
- # Verify checkpoints were created. - assert sum(1 for item in training_ckpt_path.iterdir() if item.is_dir()) == 1, ( - "Expected 1 checkpoint with 5 training steps and validation interval of 5." - ) - - # Auto-resume training from checkpoint. For this test, we auto-resume from the best checkpoint. - main(vit_config) + # Verify checkpoints were created. + num_ckpts_before = sum(1 for item in training_ckpt_path.iterdir() if item.is_dir()) + assert num_ckpts_before >= 1, f"Expected at least 1 checkpoint; found {num_ckpts_before}." + + # Auto-resume training from checkpoint. For this test, resume from the best/latest checkpoint. + main(vit_config) + + # Ensure auto-resume doesn't create additional checkpoint directories. + num_ckpts_after = sum(1 for item in training_ckpt_path.iterdir() if item.is_dir()) + assert num_ckpts_after == num_ckpts_before, "Auto-resume should not create extra checkpoint dirs."
21-21: Import path robustness for train.main.Depending on how pytest sets sys.path,
from train import maincan resolve the wrong module. Consider a more explicit import (package path) or file‑scoped import via importlib. Optional if CI is stable.-from train import main +# Option A (package import): from recipes.vit.train import main +# Option B (file import inside test): +# spec = importlib.util.spec_from_file_location("vit_train", Path(__file__).parent / "train.py") +# mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod); main = mod.main
1-63: Optional: add a smoke 2‑GPU test gated by availability.Helps catch regressions in DP=2 without burdening CI on 1‑GPU runners.
@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="Requires >=2 GPUs") def test_train_dp2_smoke(tmp_path, monkeypatch): monkeypatch.setenv("LOCAL_RANK", "0") monkeypatch.setenv("RANK", "0") monkeypatch.setenv("WORLD_SIZE", "2") monkeypatch.setenv("MASTER_ADDR", "127.0.0.1") with socket.socket() as s: s.bind(("127.0.0.1", 0)) monkeypatch.setenv("MASTER_PORT", str(s.getsockname()[1])) # Compose config with dp_shard=2 and very small work. # (Reuse compose/initialize_config_dir as above, set steps=2, val_interval=2, num_workers=0.) # Call main(cfg) and assert it completes without exception.Want me to push a patch adding this guarded smoke test?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
recipes/vit/.gitignore(1 hunks)recipes/vit/.ruff.toml(1 hunks)recipes/vit/Dockerfile(1 hunks)recipes/vit/LICENSE(1 hunks)recipes/vit/README.md(1 hunks)recipes/vit/beans.py(1 hunks)recipes/vit/checkpoint.py(1 hunks)recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)recipes/vit/distributed.py(1 hunks)recipes/vit/infer.py(1 hunks)recipes/vit/requirements.txt(1 hunks)recipes/vit/test_infer.py(1 hunks)recipes/vit/test_train.py(1 hunks)recipes/vit/train.py(1 hunks)recipes/vit/vit.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
- recipes/vit/requirements.txt
- recipes/vit/config/defaults.yaml
- recipes/vit/.ruff.toml
- recipes/vit/beans.py
- recipes/vit/LICENSE
- recipes/vit/Dockerfile
- recipes/vit/config/vit_te_base_patch16_224.yaml
- recipes/vit/distributed.py
- recipes/vit/config/vit_base_patch16_224.yaml
- recipes/vit/.gitignore
- recipes/vit/infer.py
- recipes/vit/vit.py
- recipes/vit/checkpoint.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-09T23:57:34.120Z
Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1139
File: models/geneformer/src/geneformer/convert.py:1-3
Timestamp: 2025-09-09T23:57:34.120Z
Learning: NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier instead of the standard "Apache-2.0" identifier in their codebase header files.
Applied to files:
recipes/vit/test_infer.py
🧬 Code graph analysis (3)
recipes/vit/test_infer.py (4)
recipes/vit/checkpoint.py (1)
save_dcp_checkpoint(120-131)recipes/vit/distributed.py (1)
initialize_distributed(23-85)recipes/vit/infer.py (1)
main(30-51)recipes/vit/vit.py (1)
build_vit_model(77-105)
recipes/vit/test_train.py (1)
recipes/vit/train.py (1)
main(42-307)
recipes/vit/train.py (4)
recipes/vit/beans.py (2)
BeansDataset(37-65)infinite_dataloader(27-34)recipes/vit/checkpoint.py (2)
load_auto_resume_checkpoint(75-117)save_auto_resumable_checkpoint(134-165)recipes/vit/distributed.py (1)
initialize_distributed(23-85)recipes/vit/vit.py (1)
build_vit_model(77-105)
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
121-121: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
159-159: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests (recipes/vit, svcbionemo023/bionemo-framework:pytorch25.06-py3-squashed-zstd)
🔇 Additional comments (7)
recipes/vit/train.py (3)
41-112: LGTM!The training setup is comprehensive and well-configured:
- Proper Hydra integration with clear configuration
- Comprehensive profiling setup (memory profiling and WandB)
- Correct device mesh initialization and FSDP configuration
- Proper auto-resume checkpoint loading
The detailed comments explaining the FSDP parameters are particularly helpful.
218-248: LGTM!The validation logic is well-implemented:
- Proper model evaluation mode setting
- Distributed loss reduction across the HSDP group
- Appropriate checkpoint saving after validation
- Clean validation loop structure
294-307: LGTM!The cleanup logic is appropriate:
- Proper memory profiler snapshot dumping with TODO for modernization
- Correct WandB finalization on main process only
- Clean resource management
recipes/vit/test_infer.py (2)
2-2: Use NVIDIA-approved SPDX license identifier.The retrieved learning confirms that NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier in their codebase header files. This is correct as implemented.
55-62: LGTM!The checkpoint creation and conversion logic is well-structured:
- Proper distributed context management
- Correct model initialization with device mesh
- Clean DCP to torch checkpoint conversion
- Appropriate use of temporary paths
recipes/vit/README.md (1)
1-4: LGTM!The header section is well-structured with proper attribution to the original TIMM work by Ross Wightman and clear acknowledgment of the adaptation.
recipes/vit/test_train.py (1)
24-26: Nice parametrization and coverage of meta-device init.Good breadth for the resume path and config variants.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
.github/workflows/unit-tests-recipes.yml (1)
134-134: Parameterize shm size for future tuning.Minor: make 16G configurable without editing YAML.
unit-tests: needs: changed-dirs runs-on: linux-amd64-gpu-l4-latest-1 @@ - container: + env: + CI_SHM_SIZE: 16G + container: image: ${{ matrix.recipe.image }} - options: --shm-size=16G + options: --shm-size=${{ env.CI_SHM_SIZE }}LICENSE/third_party.txt (4)
874-875: Standardize “License Text:” URL + add upstream commit references.For consistency and auditability, add “License Text:” lines pointing to the upstream LICENSE file and an “Upstream ref:” line with the exact commit SHA used.
Apply example edits:
-https://github.com/huggingface/pytorch-image-models +License Text: https://github.com/huggingface/pytorch-image-models/blob/main/LICENSE +Upstream ref: https://github.com/huggingface/pytorch-image-models/commit/<sha> -https://github.com/google-research/vision_transformer -https://github.com/google-research/big_vision +License Text: https://github.com/google-research/vision_transformer/blob/main/LICENSE +License Text: https://github.com/google-research/big_vision/blob/main/LICENSE +Upstream ref: https://github.com/google-research/vision_transformer/commit/<sha>, https://github.com/google-research/big_vision/commit/<sha> -https://github.com/AI-Lab-Makerere/ibean +License Text: https://github.com/AI-Lab-Makerere/ibean/blob/master/LICENSE +Upstream ref: https://github.com/AI-Lab-Makerere/ibean/commit/<sha> -https://github.com/lucidrains/vit-pytorch +License Text: https://github.com/lucidrains/vit-pytorch/blob/main/LICENSE +Upstream ref: https://github.com/lucidrains/vit-pytorch/commit/<sha> -https://github.com/karpathy/minGPT +License Text: https://github.com/karpathy/minGPT/blob/master/LICENSE +Upstream ref: https://github.com/karpathy/minGPT/commit/<sha>Also applies to: 1079-1082, 1286-1290, 1311-1313, 1336-1338
874-1285: Reduce redundant Apache-2.0 full-text blocks.Multiple full Apache texts bloat the file. Keep one canonical Apache-2.0 text and list all covered projects beneath it, unless legal requires embedding each upstream’s exact block. If you keep duplicates, at least drop the Appendix sections to avoid placeholder confusion.
1056-1076: Appendix placeholders can confuse consumers.These “[yyyy]/[name of copyright owner]” and “{}” placeholder appendices are boilerplate for authors, not third-party attribution. Trim them to prevent misinterpretation.
Also applies to: 1260-1284
874-1346: Add automated license scanning to CI for drift and completeness.Integrate a lightweight step (e.g., scancode-toolkit, pip-licenses, or reuse) to ensure all transitive deps in recipes/vit/requirements.txt and Docker images are represented here, and to catch future drift.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/unit-tests-recipes.yml(1 hunks)LICENSE/third_party.txt(1 hunks)recipes/vit/LICENSE(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- recipes/vit/LICENSE
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (3)
.github/workflows/unit-tests-recipes.yml (1)
134-134: Good call increasing /dev/shm for ViT/TE tests.This should mitigate DataLoader multiprocessing and NCCL shared‑mem pressure in containerized runs.
To be safe on self‑hosted GPU runners, confirm Docker honors this flag (some setups inject their own runopts). You can add a quick check step:
- name: Show GPU info run: nvidia-smi + - name: Verify container /dev/shm size + run: | + echo "Expecting /dev/shm ~= 16G" + df -h /dev/shm || trueLICENSE/third_party.txt (2)
1336-1346: Keep minGPT license entry; attribution exists in recipes/vit/vit.pyrecipes/vit/vit.py contains "inspired by Andrej Karpathy's minGPT" (≈line 36), so retain the LICENSE/third_party.txt entry; add per-file header/NOTICE only if you substantially adapted code.
1286-1310: Clarify Beans dataset license — Hugging Face card is blank; iBean repo is MIT.
- Finding: Hugging Face dataset card for AI-Lab-Makerere/beans lists no license; the iBean GitHub repository LICENSE is MIT.
- Action (file: LICENSE/third_party.txt, lines 1286–1310): either replace/supplement the MIT block with the dataset's actual license or add an explicit note that the Hugging Face dataset has no license listed and the MIT text refers only to the iBean repo code. Confirm the dataset license with the dataset owner if unclear.
cf087af to
ef8f31d
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
recipes/vit/vit.py (2)
515-524: Bias concatenation order is wrong (breaks attention/MLP); also avoid registering None buffers.
self.in_splitis[mlp_hidden_dim, dim, dim, dim], but the bias is concatenated as(qkv_bias, mlp_bias). This assigns the MLP bias to QKV and zero to MLP — functional bug. Also,register_buffer(..., None)/register_parameter(..., None)is unnecessary here.Apply this diff:
self.in_split = [mlp_hidden_dim] + [dim] * 3 - if qkv_bias: - self.register_buffer("qkv_bias", None) - self.register_parameter("mlp_bias", None) + if qkv_bias: + # Use in-proj's bias for both MLP and QKV when qkv_bias=True + self.qkv_bias = None + self.mlp_bias = None else: self.register_buffer("qkv_bias", torch.zeros(3 * dim), persistent=False) self.mlp_bias = nn.Parameter(torch.zeros(mlp_hidden_dim)) @@ - if self.mlp_bias is not None: + if self.mlp_bias is not None: # Concat constant zero-bias for qkv w/ trainable mlp_bias. # Appears faster than adding to x_mlp separately - y = F.linear(y, self.in_proj.weight, torch.cat((self.qkv_bias, self.mlp_bias))) + # Order must match self.in_split = [mlp, q, k, v] + y = F.linear(y, self.in_proj.weight, torch.cat((self.mlp_bias, self.qkv_bias))) else: y = self.in_proj(y)Also applies to: 540-549, 542-547
1593-1596: Guard pos_embed when disabled.When
pos_embed=False,self.pos_embedis None; current calls will crash.Apply this diff:
- x = self.pos_embed(x) + if self.pos_embed is not None: + x = self.pos_embed(x) @@ - x = self.pos_embed(x) + if self.pos_embed is not None: + x = self.pos_embed(x)Also applies to: 1677-1680
🧹 Nitpick comments (2)
recipes/vit/vit.py (2)
132-138: Fail fast when TE is requested but not installed.Currently it silently falls back to non‑TE. Raise a clear error.
Apply this diff:
- if cfg.model.transformer_engine and _TE_INSTALLED: + if cfg.model.transformer_engine and _TE_INSTALLED: assert device_mesh is not None, "[build_model] device_mesh is required when using TransformerEngine." vit_kwargs["block_fn"] = TransformerLayer vit_kwargs["micro_batch_size"] = cfg.dataset.train.batch_size vit_kwargs["tp_group"] = device_mesh["tp"].get_group() vit_kwargs["tp_size"] = device_mesh["tp"].size() + elif cfg.model.transformer_engine and not _TE_INSTALLED: + raise ImportError("TransformerEngine was requested (cfg.model.transformer_engine=True) but is not installed.")
139-141: Avoid channels_last on meta.Calling
.to(memory_format=...)on meta parameters is risky. Gate it onnot meta_init(see earlier diff).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
LICENSE/third_party.txt(1 hunks)recipes/vit/LICENSE(1 hunks)recipes/vit/beans.py(1 hunks)recipes/vit/vit.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- recipes/vit/beans.py
- recipes/vit/LICENSE
- LICENSE/third_party.txt
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.
📚 Learning: 2025-09-16T19:36:07.174Z
Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.
Applied to files:
recipes/vit/vit.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (4)
recipes/vit/vit.py (4)
16-80: Attribution and licensing look correct.Apache-2.0 and MIT attributions for timm, Phil Wang, and Karpathy are clearly included. Matches our third_party entry.
1510-1516: TE internal attribute access is brittle across versions.
self_attention.proj.weightandlayernorm_mlp.fc2_weightmay not exist in all TE releases. Guard with hasattr or gate behind a version check.Apply this diff:
- if _TE_INSTALLED and self.block_fn == TransformerLayer: - rescale(layer.self_attention.proj.weight.data, layer_id + 1) - rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1) + if _TE_INSTALLED and self.block_fn == TransformerLayer: + if hasattr(layer, "self_attention") and hasattr(layer.self_attention, "proj") and hasattr(layer.self_attention.proj, "weight"): + rescale(layer.self_attention.proj.weight.data, layer_id + 1) + if hasattr(layer, "layernorm_mlp") and hasattr(layer.layernorm_mlp, "fc2_weight"): + rescale(layer.layernorm_mlp.fc2_weight.data, layer_id + 1)If you want, I can script-check the installed TE version and symbols across CI runners.
106-112: Defensive TE import is fine.Short-circuit on
_TE_INSTALLEDpreventsNameErrorwhenTransformerLayerisn’t present.
125-141: Meta init context is invalid; torch.device is not a context manager.Using
with torch.device("meta")will raise at runtime. Switch totorch.set_default_device("meta")with a try/finally, and avoid channels_last on meta.Apply this diff:
- with ( - # Meta Device Initialization - torch.device("meta") if meta_init else nullcontext() - ): - vit_kwargs = dict(cfg.model.vit) - if meta_init: - vit_kwargs["weight_init"] = None - if cfg.model.transformer_engine and _TE_INSTALLED: - assert device_mesh is not None, "[build_model] device_mesh is required when using TransformerEngine." - vit_kwargs["block_fn"] = TransformerLayer - vit_kwargs["micro_batch_size"] = cfg.dataset.train.batch_size - vit_kwargs["tp_group"] = device_mesh["tp"].get_group() - vit_kwargs["tp_size"] = device_mesh["tp"].size() - model = VisionTransformer(**vit_kwargs) - if cfg.model.channels_last: - model.to(memory_format=torch.channels_last) + if meta_init: + torch.set_default_device("meta") + try: + vit_kwargs = dict(cfg.model.vit) + if meta_init: + vit_kwargs["weight_init"] = None + if cfg.model.transformer_engine and _TE_INSTALLED: + assert device_mesh is not None, "[build_model] device_mesh is required when using TransformerEngine." + vit_kwargs["block_fn"] = TransformerLayer + vit_kwargs["micro_batch_size"] = cfg.dataset.train.batch_size + vit_kwargs["tp_group"] = device_mesh["tp"].get_group() + vit_kwargs["tp_size"] = device_mesh["tp"].size() + model = VisionTransformer(**vit_kwargs) + if cfg.model.channels_last and not meta_init: + model.to(memory_format=torch.channels_last) + finally: + if meta_init: + torch.set_default_device("cpu")Likely an incorrect or invalid review comment.
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
…cripts and tests using Torch.save checkpoint loading. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
ef8f31d to
260c5a5
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (4)
recipes/vit/README.md (3)
26-31: Replace PIP_CONSTRAINT tip with venv setupConstraints can’t be “emptied” via env var; recommend a clean venv instead.
-```bash -cd recipes/vit -# If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. -# We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. -pip install -r requirements.txt -``` +```bash +cd recipes/vit +# Recommended: create and activate a clean virtual environment (or use the provided Docker image) +python -m venv .venv && source .venv/bin/activate +pip install -r requirements.txt +```
41-41: Fix PyTorch DCP links to canonical domainUse pytorch.org, not docs.pytorch.org.
-which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory. +which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [PyTorch Distributed Checkpoint (DCP)](https://pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory. @@ - Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html + Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html @@ - Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html + Docs: https://pytorch.org/docs/stable/distributed.checkpoint.htmlAlso applies to: 62-63, 76-77
55-57: Add missing torch import in the Python snippetPrevents NameError and makes the example runnable.
-```python +```python import torchrecipes/vit/checkpoint.py (1)
155-165: Avoid world‑writable 0777 on checkpoints; prefer 0o775 (or respect umask)0777 is risky on shared filesystems. Tighten perms; keep the rank‑0 logging guard.
- # Relax checkpoint permissions, which may be helpful when saving checkpoints in a container owned by root. - mode = 0o777 + # Relax checkpoint permissions for collaboration; avoid world-writable. + mode = 0o775 for dirpath, _, filenames in os.walk(ckpt_dir): # Change current directory perms. os.chmod(dirpath, mode) for filename in filenames: # Change file perms. file_path = Path(dirpath) / filename os.chmod(file_path, mode) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: _logger.info(f"Saved validated checkpoint to: {ckpt_dir}")
🧹 Nitpick comments (11)
recipes/vit/beans.py (2)
49-56: Make sampler.set_epoch() optional to avoid AttributeError with non-Distributed samplersSequentialSampler and many custom samplers don’t implement set_epoch. Guard the call.
def infinite_dataloader(dataloader, sampler): """Create an infinite iterator that automatically restarts at the end of each epoch.""" epoch = 0 while True: - sampler.set_epoch(epoch) # Update epoch for proper shuffling + # Update epoch for proper shuffling when supported (e.g., DistributedSampler) + if sampler is not None and hasattr(sampler, "set_epoch"): + sampler.set_epoch(epoch) for batch in dataloader: yield batch epoch += 1 # Increment epoch counter after completing one full pass
40-44: Specify interpolation when resizing to avoid PIL default (nearest) degrading image qualityPIL.Image.resize defaults to nearest-neighbor. Use bicubic/bilinear for classification inputs.
import torch from datasets import load_dataset from torch.utils.data import Dataset from torchvision.transforms.functional import to_tensor +from PIL import Image @@ - image_tensor = to_tensor(sample["image"].resize(self.resize_dimensions).convert("RGB")) + img = sample["image"].convert("RGB").resize(self.resize_dimensions, resample=Image.BICUBIC) + image_tensor = to_tensor(img) label_idx = sample["labels"] return image_tensor, label_idxAlso applies to: 85-87
recipes/vit/checkpoint.py (5)
35-36: Fix PyTorch docs URLs to canonical domainUse pytorch.org for tutorials/docs.
- Docs: https://docs.pytorch.org/tutorials/beginner/saving_loading_models.html + Docs: https://pytorch.org/tutorials/beginner/saving_loading_models.html @@ - Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html + Docs: https://pytorch.org/docs/stable/distributed.checkpoint.html @@ - Docs: https://docs.pytorch.org/docs/stable/distributed.checkpoint.html + Docs: https://pytorch.org/docs/stable/distributed.checkpoint.htmlAlso applies to: 60-61, 123-124
45-51: Broaden the 'module.' prefix warning check and avoid StopIteration on empty state dictCheck all keys from the raw checkpoint dict; don’t rely on the first key.
- first_key = next(iter(model_checkpoint)) - if first_key.startswith("module.") and not megatron_fsdp: + if (not megatron_fsdp) and any(k.startswith("module.") for k in checkpoint["model"].keys()): _logger.warning( - f"Checkpoint state dictionary keys ({first_key}) may be prefixed " + "Checkpoint state dictionary keys may be prefixed " "with 'module.' if converted from a Megatron-FSDP DCP checkpoint." "Set megatron_fsdp=True to automatically strip the prefix." )
57-73: Use DCP with stateful objects directly; avoid redundant state_dict round‑tripPassing the model/optimizer objects is the intended API and avoids extra memory/copies.
def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None): @@ - # Load model and optimizer checkpoints. - state_dict = {} - if model is not None: - state_dict["model"] = model.state_dict() - if optimizer is not None: - state_dict["optimizer"] = optimizer.state_dict() - torch.distributed.checkpoint.load(state_dict, checkpoint_id=checkpoint_path) - if model is not None: - model.load_state_dict(state_dict["model"]) - if optimizer is not None: - optimizer.load_state_dict(state_dict["optimizer"]) + # Load model and optimizer checkpoints. + state = {} + if model is not None: + state["model"] = model + if optimizer is not None: + state["optimizer"] = optimizer + torch.distributed.checkpoint.load(state, checkpoint_id=checkpoint_path)
120-131: Mirror the stateful-objects approach for DCP saveThis aligns save/load symmetry and leverages DCP’s integration with stateful objects.
def save_dcp_checkpoint(checkpoint_path, model=None, optimizer=None): @@ - # Save model and optimizer checkpoints. - state_dict = {} - if model is not None: - state_dict["model"] = model.state_dict() - if optimizer is not None: - state_dict["optimizer"] = optimizer.state_dict() - torch.distributed.checkpoint.save(state_dict, checkpoint_id=checkpoint_path) + # Save model and optimizer checkpoints. + state = {} + if model is not None: + state["model"] = model + if optimizer is not None: + state["optimizer"] = optimizer + torch.distributed.checkpoint.save(state, checkpoint_id=checkpoint_path)
99-111: Harden auto‑resume checkpoint selection against malformed namesDefensively parse “step_loss”; skip dirs that don’t match; only use loss sorting for resume_from_metric in {“+”, “-”}.
- opt_metric_coeff = 1 if cfg.training.checkpoint.resume_from_metric == "+" else -1 - latest_subdir = max( - subdirs, - key=lambda x: ( - opt_metric_coeff * float(x.name.split("_")[3]) - if cfg.training.checkpoint.resume_from_metric - else 0, - x.stat().st_mtime, - ), - ) + mode = str(getattr(cfg.training.checkpoint, "resume_from_metric", "")).strip() + opt_metric_coeff = 1 if mode == "+" else -1 + def _score(p: Path): + parts = p.name.split("_") + if mode in {"+", "-"} and len(parts) >= 4 and parts[0] == "step" and parts[2] == "loss": + try: + return (opt_metric_coeff * float(parts[3]), p.stat().st_mtime) + except ValueError: + pass + return (0, p.stat().st_mtime) + latest_subdir = max(subdirs, key=_score)recipes/vit/README.md (4)
3-3: Fix TIMM casing and replace bare URL with a markdown linkMinor polish; avoids MD034 and improves readability.
-_Adapted ViT model code from huggingface/pytorch-image-models (TImM) written by Ross Wightman (@rwightman / Copyright 2020), which you can check out here: https://github.com/huggingface/pytorch-image-models_ +_Adapted ViT model code from [huggingface/pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) written by Ross Wightman (@rwightman, © 2020)._
11-13: Add languages to fenced code blocks and note NGPU envImproves linting (MD040) and makes examples copy‑pasteable.
-``` +```bash docker build -t <image_repo>:<image_tag> .@@
-+bashUtilize plenty of shared memory (--shm-size) to support loading large batches of image data!
docker run -it --rm --gpus=all --shm-size=16G <image_repo>:<image_tag>
@@ -``` +```bash cd recipes/vit # If this causes problems, you can add PIP_CONSTRAINT= before the `pip install` command to ignore potentially trivial dependency conflicts. # We strongly recommend installing into a clean virtual environment or CUDA container, such as the image built from the Dockerfile in this recipe. pip install -r requirements.txt@@
-+bash
+# set NGPU to the number of visible GPUs, e.g.: export NGPU=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit@@ -``` +```bash python -m torch.distributed.checkpoint.format_utils --help usage: format_utils.py [-h] {torch_to_dcp,dcp_to_torch} src dst @@ -``` +```bash python -m torch.distributed.checkpoint.format_utils dcp_to_torch step_75_loss_1.725 torch_ckpt_test.ptAlso applies to: 17-21, 26-31, 37-39, 105-117, 121-123 --- `5-9`: **Normalize heading levels (MD001)** Increment by one level at a time for cleaner structure. ```diff -### Pre-Requisites +## Pre-Requisites @@ -#### Docker Container +### Docker Container @@ -#### PIP Install +### PIP Install
159-159: Replace bare URL with a labeled linkMinor docs polish.
-TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, you can look into: https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/ +TODO(@cspades): For converting DCP directly to HuggingFace SafeTensors checkpoints, see the PyTorch blog: [HuggingFace `safetensors` support in PyTorch Distributed Checkpointing](https://pytorch.org/blog/huggingface-safetensors-support-in-pytorch-distributed-checkpointing/).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
.github/workflows/unit-tests-recipes.yml(1 hunks)LICENSE/third_party.txt(1 hunks)recipes/vit/.gitignore(1 hunks)recipes/vit/.ruff.toml(1 hunks)recipes/vit/Dockerfile(1 hunks)recipes/vit/LICENSE(1 hunks)recipes/vit/README.md(1 hunks)recipes/vit/beans.py(1 hunks)recipes/vit/checkpoint.py(1 hunks)recipes/vit/config/defaults.yaml(1 hunks)recipes/vit/config/vit_base_patch16_224.yaml(1 hunks)recipes/vit/config/vit_te_base_patch16_224.yaml(1 hunks)recipes/vit/distributed.py(1 hunks)recipes/vit/infer.py(1 hunks)recipes/vit/requirements.txt(1 hunks)recipes/vit/test_infer.py(1 hunks)recipes/vit/test_train.py(1 hunks)recipes/vit/train.py(1 hunks)recipes/vit/vit.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- recipes/vit/.gitignore
🚧 Files skipped from review as they are similar to previous changes (15)
- .github/workflows/unit-tests-recipes.yml
- recipes/vit/config/vit_base_patch16_224.yaml
- recipes/vit/LICENSE
- recipes/vit/Dockerfile
- recipes/vit/config/vit_te_base_patch16_224.yaml
- recipes/vit/.ruff.toml
- recipes/vit/test_infer.py
- recipes/vit/test_train.py
- recipes/vit/infer.py
- recipes/vit/distributed.py
- recipes/vit/train.py
- recipes/vit/requirements.txt
- recipes/vit/config/defaults.yaml
- LICENSE/third_party.txt
- recipes/vit/vit.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.
📚 Learning: 2025-09-16T19:36:07.174Z
Learnt from: cspades
PR: NVIDIA/bionemo-framework#1103
File: LICENSE/third_party.txt:1079-1285
Timestamp: 2025-09-16T19:36:07.174Z
Learning: The recipes/vit/vit.py file already contains comprehensive attribution for Google ViT research code, including paper references, upstream repository links, and proper acknowledgments of the original authors.
Applied to files:
recipes/vit/README.md
📚 Learning: 2025-09-09T23:57:34.120Z
Learnt from: pstjohn
PR: NVIDIA/bionemo-framework#1139
File: models/geneformer/src/geneformer/convert.py:1-3
Timestamp: 2025-09-09T23:57:34.120Z
Learning: NVIDIA's legal team has specified using "LicenseRef-Apache2" as the SPDX license identifier instead of the standard "Apache-2.0" identifier in their codebase header files.
Applied to files:
recipes/vit/checkpoint.py
🧬 Code graph analysis (1)
recipes/vit/beans.py (1)
sub-packages/bionemo-size-aware-batching/tests/bionemo/size_aware_batching/conftest.py (1)
sampler(92-93)
🪛 markdownlint-cli2 (0.17.2)
recipes/vit/README.md
3-3: Bare URL used
(MD034, no-bare-urls)
5-5: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
121-121: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
159-159: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (1)
recipes/vit/checkpoint.py (1)
2-2: SPDX identifier is correct for this repoLicenseRef-Apache2 is the expected identifier per NVIDIA policy; keep as-is.
Description
TODO
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
Note
By default, the notebooks validation tests are skipped unless explicitly enabled.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Pre-submit Checklist
Summary by CodeRabbit
New Features
Documentation
Tests
Chores