eedVR2 3B quality on 8GB VRAM (tiling/offload) — seeking guidance

Cross-posted here: https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/issues/528

# Quality Issues with SeedVR2 3B on 8GB VRAM GPU — Seeking Guidance

## Environment

| Component | Details |
|-----------|---------|
| GPU | NVIDIA GeForce RTX 5070 Laptop GPU, **8GB VRAM** |
| CPU | AMD Ryzen 9 9950X |
| PyTorch | 2.11.0.dev+cu128 |
| CUDA | 12.8 |
| Python | 3.12 |
| Model | SeedVR2 3B (GGUF Q8_0, 3,491 MB) |
| VAE | ema_vae (478 MB) |
| OS | Windows 11 |

## Context

I'm integrating SeedVR2 3B into a standalone desktop application (not ComfyUI) for video restoration from analog tape sources (VHS, Hi8, DV). The implementation uses:

- **Block-by-block CPU↔GPU offloading** for the DiT (only one transformer block on GPU at a time, FP8 compressed during transfer)
- **Tiled VAE encode/decode** (512×512 pixel tiles, 160px overlap, cosine blending) since the full frame won't fit in 8GB
- **GGUF Q8_0** weights for the DiT (425 tensors FP16 + 210 tensors Q8_0)
- **1-step distilled** inference (Euler sampler)

## Problem Description

The output has significantly lower quality than official demos and ComfyUI results on high-VRAM GPUs. Specifically:

1. **Dark patches / artifacts** (partially fixed — see below)
2. **Waxy / blurry appearance** (partially fixed — see below)
3. **Visible lines within VAE tiles** (partially fixed — see below)
4. **Overall softness / lack of detail compared to official demos**

## What I've Already Investigated and Fixed

I did a detailed diff between the official ByteDance `SeedVR` repo code and my implementation. Three issues were found and corrected:

### Fix 1: VAE dtype — float16 → bfloat16

- **Problem**: VAE was running in `float16` (max 65,504). The 3D causal conv intermediate activations exceeded this range → overflow → dark artifacts.
- **Fix**: Changed to `bfloat16` (matches official `configs_3b/main.yaml: vae.dtype: bfloat16`).
- **Result**: Dark patches eliminated.

### Fix 2: Color correction — LAB → wavelet only

- **Problem**: Post-processing applied `wavelet_reconstruction()` followed by LAB histogram matching. The LAB step over-corrected chrominance, producing a waxy/washed-out look.
- **Fix**: Changed to wavelet-only color correction (matches official pipeline).
- **Result**: Waxy appearance reduced.

### Fix 3: VAE conv spatial splitting — disabled

- **Problem**: `VAE_CONV_MAX_MEM` was set to 0.125 GiB (vs official 0.5 GiB), forcing `InflatedCausalConv3d.memory_limit_conv()` to split conv3d inputs along H/W dimensions. This truncates the receptive field at split boundaries → visible horizontal/vertical lines within tiles.
- **Fix**: Set `VAE_CONV_MAX_MEM = float("inf")` to disable spatial splitting. With 512px tiles, peak per-tile VRAM is ~200–300 MB, well within budget.
- **Result**: Internal tile lines eliminated.

## What Still Doesn't Match Official Quality

Even after these three fixes, the output is noticeably softer than official demos. The image content is recognizable and the worst artifacts are gone, but fine detail and sharpness are lacking.

## Suspected Remaining Issues

### 1. Tiled VAE encode uses `mode()` instead of `sample()`

- The official code uses `.latent` (stochastic sampling from the posterior)
- My tiled approach uses `.latent_dist.mode()` (deterministic mean)
- **Reason**: Stochastic sampling per-tile would create noise inconsistencies at tile boundaries
- **Question**: Is there a better approach for tiled VAE encoding that preserves stochastic behavior while maintaining tile boundary consistency?

### 2. 512px tiles may be too small

- Official SeedVR2 runs on 80GB H100s with no tiling at all
- My 512px tiles with 160px overlap may lose too much global context
- ComfyUI uses 736px tiles on 16GB GPUs
- **Question**: What is the minimum tile size that preserves quality? Would 640px or 768px tiles work better, or is the quality loss inherent to tiling?

### 3. GGUF Q8_0 quantization impact

- Weight comparison shows max error 0.015625, mean error 0.000135 vs FP16
- A/B test showed PSNR 20.90 dB between Q8_0 and FP16 outputs
- Q8_0 output actually looked slightly better in our tests
- **Question**: Is Q8_0 expected to have meaningful quality loss vs FP16 for SeedVR2 3B?

### 4. Block-by-block DiT offloading

- Each of the 32 transformer blocks is loaded from CPU → GPU → inference → back to CPU
- FP8 compression during CPU storage (halves transfer size)
- **Question**: Does processing blocks individually (vs keeping the full model on GPU) affect attention quality? Each block still sees the full sequence, but are there cross-block dependencies that require simultaneous residency?

### 5. VAE tiling overlap/blending

- Using cosine blending in the overlap region (160px)
- **Question**: Is cosine blending optimal, or does the official codebase use a different blending strategy? Would Gaussian blending or feathering produce better results?

## Specific Questions for the Community

1. **Has anyone achieved official-demo-quality results on an 8GB GPU?** If so, what settings/compromises were used?

2. **Is there a known minimum VRAM threshold** below which SeedVR2 3B cannot produce good results regardless of tiling/offloading strategies?

3. **Are there any other pipeline differences** between the official ByteDance inference code and typical 8GB GPU implementations that I might have missed?

4. **Would SeedVR2 7B with more aggressive quantization** (e.g., Q4) produce better results than 3B at Q8 on 8GB VRAM?

## Reproduction

Single-frame test at 720p output (SD input → 720p upscale):

- Input: 480×360 DV frame
- Output target: 960×720 (padded to divisible-by-8)
- VAE tiles: 512×512, overlap 160px
- DiT: Full latent, block-by-block offload, FP8 mode
- Diffusion: 1 step, Euler sampler
- Color correction: wavelet only
- Processing time: ~35 seconds per frame

## References

- Official repo: [ByteDance-Seed/SeedVR](https://github.com/ByteDance-Seed/SeedVR)
- ComfyUI node: [numz/ComfyUI-SeedVR2_VideoUpscaler](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler)
- Related issues: [#159 (TRT performance)](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/issues/159), [#394 (DiT tiling)](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/issues/394)

Any guidance or insights from folks who have worked on low-VRAM SeedVR2 implementations would be greatly appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eedVR2 3B quality on 8GB VRAM (tiling/offload) — seeking guidance #54

Quality Issues with SeedVR2 3B on 8GB VRAM GPU — Seeking Guidance

Environment

Context

Problem Description

What I've Already Investigated and Fixed

Fix 1: VAE dtype — float16 → bfloat16

Fix 2: Color correction — LAB → wavelet only

Fix 3: VAE conv spatial splitting — disabled

What Still Doesn't Match Official Quality

Suspected Remaining Issues

1. Tiled VAE encode uses `mode()` instead of `sample()`

2. 512px tiles may be too small

3. GGUF Q8_0 quantization impact

4. Block-by-block DiT offloading

5. VAE tiling overlap/blending

Specific Questions for the Community

Reproduction

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Details
GPU	NVIDIA GeForce RTX 5070 Laptop GPU, 8GB VRAM
CPU	AMD Ryzen 9 9950X
PyTorch	2.11.0.dev+cu128
CUDA	12.8
Python	3.12
Model	SeedVR2 3B (GGUF Q8_0, 3,491 MB)
VAE	ema_vae (478 MB)
OS	Windows 11

eedVR2 3B quality on 8GB VRAM (tiling/offload) — seeking guidance #54

Description

Quality Issues with SeedVR2 3B on 8GB VRAM GPU — Seeking Guidance

Environment

Context

Problem Description

What I've Already Investigated and Fixed

Fix 1: VAE dtype — float16 → bfloat16

Fix 2: Color correction — LAB → wavelet only

Fix 3: VAE conv spatial splitting — disabled

What Still Doesn't Match Official Quality

Suspected Remaining Issues

1. Tiled VAE encode uses mode() instead of sample()

2. 512px tiles may be too small

3. GGUF Q8_0 quantization impact

4. Block-by-block DiT offloading

5. VAE tiling overlap/blending

Specific Questions for the Community

Reproduction

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Tiled VAE encode uses `mode()` instead of `sample()`