-
Notifications
You must be signed in to change notification settings - Fork 66
Description
Cross-posted here: numz/ComfyUI-SeedVR2_VideoUpscaler#528
Quality Issues with SeedVR2 3B on 8GB VRAM GPU — Seeking Guidance
Environment
| Component | Details |
|---|---|
| GPU | NVIDIA GeForce RTX 5070 Laptop GPU, 8GB VRAM |
| CPU | AMD Ryzen 9 9950X |
| PyTorch | 2.11.0.dev+cu128 |
| CUDA | 12.8 |
| Python | 3.12 |
| Model | SeedVR2 3B (GGUF Q8_0, 3,491 MB) |
| VAE | ema_vae (478 MB) |
| OS | Windows 11 |
Context
I'm integrating SeedVR2 3B into a standalone desktop application (not ComfyUI) for video restoration from analog tape sources (VHS, Hi8, DV). The implementation uses:
- Block-by-block CPU↔GPU offloading for the DiT (only one transformer block on GPU at a time, FP8 compressed during transfer)
- Tiled VAE encode/decode (512×512 pixel tiles, 160px overlap, cosine blending) since the full frame won't fit in 8GB
- GGUF Q8_0 weights for the DiT (425 tensors FP16 + 210 tensors Q8_0)
- 1-step distilled inference (Euler sampler)
Problem Description
The output has significantly lower quality than official demos and ComfyUI results on high-VRAM GPUs. Specifically:
- Dark patches / artifacts (partially fixed — see below)
- Waxy / blurry appearance (partially fixed — see below)
- Visible lines within VAE tiles (partially fixed — see below)
- Overall softness / lack of detail compared to official demos
What I've Already Investigated and Fixed
I did a detailed diff between the official ByteDance SeedVR repo code and my implementation. Three issues were found and corrected:
Fix 1: VAE dtype — float16 → bfloat16
- Problem: VAE was running in
float16(max 65,504). The 3D causal conv intermediate activations exceeded this range → overflow → dark artifacts. - Fix: Changed to
bfloat16(matches officialconfigs_3b/main.yaml: vae.dtype: bfloat16). - Result: Dark patches eliminated.
Fix 2: Color correction — LAB → wavelet only
- Problem: Post-processing applied
wavelet_reconstruction()followed by LAB histogram matching. The LAB step over-corrected chrominance, producing a waxy/washed-out look. - Fix: Changed to wavelet-only color correction (matches official pipeline).
- Result: Waxy appearance reduced.
Fix 3: VAE conv spatial splitting — disabled
- Problem:
VAE_CONV_MAX_MEMwas set to 0.125 GiB (vs official 0.5 GiB), forcingInflatedCausalConv3d.memory_limit_conv()to split conv3d inputs along H/W dimensions. This truncates the receptive field at split boundaries → visible horizontal/vertical lines within tiles. - Fix: Set
VAE_CONV_MAX_MEM = float("inf")to disable spatial splitting. With 512px tiles, peak per-tile VRAM is ~200–300 MB, well within budget. - Result: Internal tile lines eliminated.
What Still Doesn't Match Official Quality
Even after these three fixes, the output is noticeably softer than official demos. The image content is recognizable and the worst artifacts are gone, but fine detail and sharpness are lacking.
Suspected Remaining Issues
1. Tiled VAE encode uses mode() instead of sample()
- The official code uses
.latent(stochastic sampling from the posterior) - My tiled approach uses
.latent_dist.mode()(deterministic mean) - Reason: Stochastic sampling per-tile would create noise inconsistencies at tile boundaries
- Question: Is there a better approach for tiled VAE encoding that preserves stochastic behavior while maintaining tile boundary consistency?
2. 512px tiles may be too small
- Official SeedVR2 runs on 80GB H100s with no tiling at all
- My 512px tiles with 160px overlap may lose too much global context
- ComfyUI uses 736px tiles on 16GB GPUs
- Question: What is the minimum tile size that preserves quality? Would 640px or 768px tiles work better, or is the quality loss inherent to tiling?
3. GGUF Q8_0 quantization impact
- Weight comparison shows max error 0.015625, mean error 0.000135 vs FP16
- A/B test showed PSNR 20.90 dB between Q8_0 and FP16 outputs
- Q8_0 output actually looked slightly better in our tests
- Question: Is Q8_0 expected to have meaningful quality loss vs FP16 for SeedVR2 3B?
4. Block-by-block DiT offloading
- Each of the 32 transformer blocks is loaded from CPU → GPU → inference → back to CPU
- FP8 compression during CPU storage (halves transfer size)
- Question: Does processing blocks individually (vs keeping the full model on GPU) affect attention quality? Each block still sees the full sequence, but are there cross-block dependencies that require simultaneous residency?
5. VAE tiling overlap/blending
- Using cosine blending in the overlap region (160px)
- Question: Is cosine blending optimal, or does the official codebase use a different blending strategy? Would Gaussian blending or feathering produce better results?
Specific Questions for the Community
-
Has anyone achieved official-demo-quality results on an 8GB GPU? If so, what settings/compromises were used?
-
Is there a known minimum VRAM threshold below which SeedVR2 3B cannot produce good results regardless of tiling/offloading strategies?
-
Are there any other pipeline differences between the official ByteDance inference code and typical 8GB GPU implementations that I might have missed?
-
Would SeedVR2 7B with more aggressive quantization (e.g., Q4) produce better results than 3B at Q8 on 8GB VRAM?
Reproduction
Single-frame test at 720p output (SD input → 720p upscale):
- Input: 480×360 DV frame
- Output target: 960×720 (padded to divisible-by-8)
- VAE tiles: 512×512, overlap 160px
- DiT: Full latent, block-by-block offload, FP8 mode
- Diffusion: 1 step, Euler sampler
- Color correction: wavelet only
- Processing time: ~35 seconds per frame
References
- Official repo: ByteDance-Seed/SeedVR
- ComfyUI node: numz/ComfyUI-SeedVR2_VideoUpscaler
- Related issues: #159 (TRT performance), #394 (DiT tiling)
Any guidance or insights from folks who have worked on low-VRAM SeedVR2 implementations would be greatly appreciated.