Add FP4/FP8 weight quantization for Blackwell/Hopper GPU inference by 2imi9 · Pull Request #516 · allenai/olmoearth_pretrain

2imi9 · 2026-03-25T03:56:23Z

nvidia-modelopt weight quantization module and benchmark scripts for OlmoEarth ViT. Supports FP4 (Blackwell) and FP8 (Hopper+).

Complementary to #477 — that PR quantizes output embeddings for storage, this PR quantizes model weights for inference.

Results on OlmoEarth-v1-Base (86M params, RTX 5090):

EuroSAT KNN classification (Sentinel-2 multispectral, 27K samples, pretrained normalization):

Precision	Accuracy	Drop vs FP32	Latency
FP32	94.3%	—	11.1ms
FP8	92.9%	-1.5%	36.2ms
FP4	89.0%	-5.3%	53.5ms

Note: FP8/FP4 use simulated quantization (real precision loss, FP32 compute). Latency is higher due to quantize-dequantize overhead. Native inference speedup requires TensorRT export, blocked by FlexiViT's dynamic shapes.

Files:

olmoearth_pretrain/quantization.py: reusable quantization module
scripts/nvfp4_quantization.py: quantization + cosine similarity pipeline
scripts/eval_quantization.py: EuroSAT multispectral KNN evaluation

nvidia-modelopt based weight quantization module and benchmark scripts for OlmoEarth ViT models. Supports FP4 (Blackwell) and FP8 (Hopper+). Results on OlmoEarth-v1-Nano (1.36M params, RTX 5090): EuroSAT KNN classification (real Sentinel-2 imagery, 27K samples): FP32: 65.6% accuracy (baseline) FP8: 65.4% accuracy (-0.2%) FP4: 63.6% accuracy (-2.0%) Embedding cosine similarity vs FP32: FP8: 0.999 mean | FP4: 0.980 mean Quantized models: huggingface.co/2imi9/olmoearth-nano-fp8, olmoearth-nano-fp4 W&B: wandb.ai/2imi9-northeastern-university/OlmoEarth_Q Files: - olmoearth_pretrain/quantization.py: reusable quantization module - scripts/nvfp4_quantization.py: quantization + cosine similarity pipeline - scripts/eval_quantization.py: real-data KNN evaluation on EuroSAT

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e16edacdde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-25T04:00:25Z

+            fp32_model.eval()
+        # Deep copy for FP4 so we keep the original FP32 model
+        fp4_model = copy.deepcopy(fp32_model)
+        fp4_model = step2_quantize(fp4_model, args.quant_config, precision=args.precision)


Gate FP4/FP8 comparisons on quantization success

step2_quantize() returns the input model unchanged when quantization is skipped or fails, and the caller still stores that object in fp4_model after a deep copy; this causes later steps to benchmark and compare an unquantized FP32 clone as if it were FP4/FP8, which can silently invalidate reported quality and throughput results in environments without ModelOpt/CUDA/nvcc.

Useful? React with 👍 / 👎.

Fixed in b0d56a7 — now checks count_quantizer_nodes() after quantization and falls back to FP32 if no nodes were inserted.

chatgpt-codex-connector · 2026-03-25T04:00:26Z

+    sim = test_emb @ train_emb.t()
+
+    # Top-k
+    topk_sim, topk_idx = sim.topk(k, dim=1)


Bound KNN top-k by training set size

The KNN path always executes sim.topk(k, dim=1) with k=20, so runs with fewer than 20 training embeddings (for example --max-train 8) crash with an out-of-range error instead of producing metrics; using min(k, n_train) avoids this hard failure for smaller/debug subsets.

Useful? React with 👍 / 👎.

Fixed in b0d56a7 — added k = min(k, len(train_emb)).

- Check quantizer node count after step2_quantize to detect failed quantization instead of silently benchmarking an FP32 clone as FP4 - Bound KNN k by training set size to prevent crash with small subsets

2imi9 · 2026-03-25T04:21:59Z

Complementary to #477 — that PR quantizes output embeddings for storage, this PR quantizes model weights for inference.

2imi9 · 2026-03-25T04:27:23Z

Note: Both FP8 and FP4 use simulated quantization (real precision loss, FP32 compute). Native inference speedup requires TensorRT export, currently blocked by FlexiViT's dynamic shapes. The quantization module and accuracy results are ready for when export support matures.

2imi9 · 2026-03-25T04:38:59Z

i tried TensorRT export with dynamo and torchscript, both fail due to FlexiViT's dynamic shapes. Would a static-shape export path be worth exploring, or is the accuracy validation sufficient for now?

Tests for count_quantizable_layers, get_model_memory_mb, count_quantizer_nodes, _get_quant_config, and availability checks. 17 tests covering all public functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hgherzog · 2026-03-30T16:09:03Z

Hi, thanks for your interest in OlmoEarth! Based on the accuracy numbers you report it seems like you are loading random weights. vit b is expected to get ~94-95 on knn. Also, it would be great if you could share the latency under different settings for the quantization and screenshots of wandb as we don't have access to the link you shared.

…retrained normalization

2imi9 · 2026-03-30T23:29:48Z

Thanks for catching that — you were right. Two issues: I tested on NANO instead of BASE, and my EuroSAT loader used RGB zero-padded to 12 bands instead of real Sentinel-2 data.

Fixed in f617aa1 — now loads EuroSAT multispectral (13 S2 bands .tif), maps to OlmoEarth's 12-band order, normalizes with pretrained computed stats.

Updated results on OlmoEarth-v1-Base (RTX 5090):

Precision	Accuracy	Drop vs FP32	Latency
FP32	94.3%	—	11.1ms
FP8	92.9%	-1.5%	36.2ms
FP4	89.0%	-5.3%	53.5ms

Latency is higher for quantized because these are fake-quantized (simulated precision loss, FP32 compute). Native speedup requires TensorRT export.

StaticOlmoEarthEncoder wraps the FlexiViT encoder with fixed shapes, enabling torch.export() and TensorRT compilation. References the same trained weights — no copying or retraining. Results on OlmoEarth-v1-Base (RTX 5090, bs=4): PyTorch eager FP32: 166.9ms (1.0x) TensorRT FP16: 34.6ms (4.8x), cosine sim 0.999999 Files: - olmoearth_pretrain/export.py: StaticOlmoEarthEncoder + export pipeline - scripts/benchmark_trt_export.py: TRT benchmark script - tests/unit/test_export.py: 11 unit tests (all passing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This reverts commit 84efff3.

2imi9 · 2026-03-31T01:34:11Z

Hi, thanks for your interest in OlmoEarth! Based on the accuracy numbers you report it seems like you are loading random weights. vit b is expected to get ~94-95 on knn. Also, it would be great if you could share the latency under different settings for the quantization and screenshots of wandb as we don't have access to the link you shared.

i also tried to Solve the TensorRT export issue in #520. Static-shape wrapper, 4.8x speedup with TRT FP16.

…ount - benchmark_trt_export.py: graceful fallback when quantization module is not installed (it lives in PR allenai#516) - export.py: verify_export uses num_timesteps instead of hardcoded T=1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Fix review feedback: gate quantization checks, bound KNN k

b0d56a7

- Check quantizer node count after step2_quantize to detect failed quantization instead of silently benchmarking an FP32 clone as FP4 - Bound KNN k by training set size to prevent crash with small subsets

Add unit tests for quantization module

778896e

Tests for count_quantizable_layers, get_model_memory_mb, count_quantizer_nodes, _get_quant_config, and availability checks. 17 tests covering all public functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix eval: use EuroSAT multispectral with proper S2 band mapping and p…

f617aa1

…retrained normalization

2imi9 closed this Mar 30, 2026

2imi9 reopened this Mar 30, 2026

2imi9 and others added 2 commits March 30, 2026 21:14

Revert "Add static-shape wrapper for TensorRT export (4.8x speedup)"

806d559

This reverts commit 84efff3.

2imi9 mentioned this pull request Mar 31, 2026

Add static-shape wrapper for TensorRT export (4.8x speedup) #520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP4/FP8 weight quantization for Blackwell/Hopper GPU inference#516

Add FP4/FP8 weight quantization for Blackwell/Hopper GPU inference#516
2imi9 wants to merge 6 commits intoallenai:mainfrom
2imi9:weight-quantization

2imi9 commented Mar 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 25, 2026

Uh oh!

2imi9 Mar 25, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 25, 2026

Uh oh!

2imi9 Mar 25, 2026

Uh oh!

2imi9 commented Mar 25, 2026

Uh oh!

2imi9 commented Mar 25, 2026

Uh oh!

2imi9 commented Mar 25, 2026

Uh oh!

Hgherzog commented Mar 30, 2026

Uh oh!

2imi9 commented Mar 30, 2026

Uh oh!

2imi9 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

2imi9 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

2imi9 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

2imi9 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

2imi9 commented Mar 25, 2026

Uh oh!

2imi9 commented Mar 25, 2026

Uh oh!

2imi9 commented Mar 25, 2026

Uh oh!

Hgherzog commented Mar 30, 2026

Uh oh!

2imi9 commented Mar 30, 2026

Uh oh!

2imi9 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2imi9 commented Mar 25, 2026 •

edited

Loading