Skip to content

GTX 1650: disambiguate GDDR5/GDDR6 variants by memory clock (measured 192 GB/s)#115

Open
cms-pm wants to merge 1 commit into
Andyyyy64:mainfrom
cms-pm:gtx1650-gddr6-variant
Open

GTX 1650: disambiguate GDDR5/GDDR6 variants by memory clock (measured 192 GB/s)#115
cms-pm wants to merge 1 commit into
Andyyyy64:mainfrom
cms-pm:gtx1650-gddr6-variant

Conversation

@cms-pm

@cms-pm cms-pm commented Jun 15, 2026

Copy link
Copy Markdown

Summary

GPU_BANDWIDTH["GTX 1650"] = 128.0 is correct for the original GDDR5 card,
but the GTX 1650 also shipped in a later GDDR6 revision at 192 GB/s
(12 Gbps × 128-bit). Both use the same TU117 die and PCI device id 0x1F82, and
the driver reports the same name, so the curated single value under-states the
GDDR6 variant by 50% — which under-predicts tok/s and under-recommends models for
a card that is still widely sold.

This PR disambiguates the two by max memory clock, the only reliable
discriminator, and leaves 128 as the conservative default when the clock is
unknown.

Change

  • data/gpu.py: GPU_MEMORY_CLOCK_VARIANTSGTX 1650 → 192 GB/s when max
    memory clock ≥ 5500 MHz (GDDR6 ≈ 6001), else 128 (GDDR5 ≈ 4001). The 5500 split
    sits well clear of both regimes.
  • gpu_db.py: resolve_detected_bandwidth(name, vram_bytes, mem_clock_mhz=None)
    consults the variant table first, then the existing curated/dbgpu path.
    Behaviour is byte-identical when mem_clock_mhz is unknown.
  • nvidia.py: capture max memory clock via NVML (NVML_CLOCK_MEM) and the
    nvidia-smi fallback (clocks.max.memory). The smi query retries without the
    clock field
    if a driver rejects it, so an optional field can never reduce
    detection to zero GPUs.

Validation (measured, not inferred)

Clock-locked, sole-tenant, llama.cpp build 2cbfdc6, on a confirmed GDDR6
board (VBIOS 90.17.4D.00.1E, nvidia-smi clocks.max.memory = 6001 MHz):

model (Q4_K_M) measured tok/s est @128 est @192
Qwen3-1.7B 75.4 ± 0.12 52 (m/p 1.45) 78 (m/p 0.97)
Llama-3.2-3B 49.6 ± 0.03 35 (m/p 1.42) 52 (m/p 0.94)

At 128, whichllm under-predicts by ~45%; at 192 it lands within ~6%. The patched
detector returns ('NVIDIA GeForce GTX 1650', 192.0) on the card.

Data & reproduction. Methodology and the full four-model calibration are in
the accompanying paper (SSRN 6941538, The kv4 Trade-off Is Workload-Dependent);
the measured evidence + clock-locked harness are public and reproducible at
https://github.com/cms-pm/kv4-edge-inference (analysis/analyze.py reprints the
tok/s used here from the shipped data). This mirrors how PR #75 cited a paper for
a legacy GPU — but with measured, independently re-runnable numbers.

Tests

tests/test_gtx1650_variants.py — 15 tests: resolver disambiguation incl.
threshold boundary and back-compat (unknown clock → 128); nvidia-smi parse incl.
[N/A] clock and the 3-field-fails→2-field-retry regression guard; and a
measured-calibration check (192 estimate scales 1.5× over 128 and lands near the
measured 75.4). Full suite green.

Scope and honest limitations

  • Cross-platform via NVML/nvidia-smi; only the bare-WMI fallback lacks a clock.
    detect_hardware() calls detect_nvidia_gpus() on every OS, and that path uses
    pynvml (nvml.dll) with an nvidia-smi fallback — both of which ship with the
    NVIDIA driver on Windows. So a GDDR6 GTX 1650 on a normal Windows box is
    disambiguated by the same memory-clock code as Linux. The 128 default is only
    reached in the narrow case where NVML and nvidia-smi both fail and detection
    falls through to the pure-WMI path (hardware/windows.py), which has no memory
    clock (Win32_VideoController does not expose VRAM clock). Teaching that
    fallback to shell nvidia-smi is a small, hardware-free follow-up; flagged here
    so the one remaining gap is explicit, not accidental.
  • Generality. The mechanism is intentionally generic — the same name+clock
    ambiguity affects other cards, most notably the GT 1030 (DDR4 ~16 GB/s vs
    GDDR5 ~48 GB/s)
    , a 3× error. This PR populates only the GTX 1650 (the variant
    measured here); GT 1030 and similar can be added to the same table as data
    without further code.
  • The GDDR5 clock (~4001 MHz) is from NVIDIA's 8 Gbps spec, not independently
    measured here; only the GDDR6 board was measured. The 5500 threshold is safe
    given the 8-vs-12 Gbps gap regardless.

Backward compatibility

Additive only. All pre-existing tests pass; the new mem_clock_mhz parameters
default to None, preserving exact prior behaviour for every code path that does
not supply a clock.

The GTX 1650 ships in two memory configurations the driver name and PCI device
id (0x1F82) cannot tell apart: original GDDR5 (8 Gbps x 128-bit = 128 GB/s) and a
later GDDR6 revision (12 Gbps x 128-bit = 192 GB/s). whichllm's single
"GTX 1650": 128.0 is right for GDDR5 but under-states GDDR6 by 50%, which then
under-predicts tok/s and under-recommends models.

Resolve by max memory clock at detection time (the only reliable discriminator):
- data/gpu.py: GPU_MEMORY_CLOCK_VARIANTS maps GTX 1650 -> 192 when max mem clock
  >= 5500 MHz (GDDR6 ~6001) else 128 (GDDR5 ~4001). 128 stays the curated default.
- gpu_db.py: resolve_detected_bandwidth(..., mem_clock_mhz) tries the variant
  first, then the existing curated/dbgpu path. Identical behaviour when unknown.
- nvidia.py: capture max mem clock via NVML (NVML_CLOCK_MEM) and nvidia-smi
  (clocks.max.memory). The smi query retries without the clock field on error so
  a missing optional field can never wipe out detection.

Validation (clock-locked, sole-tenant, build 2cbfdc6 on a GDDR6 board, VBIOS
90.17.4D.00.1E): nvidia-smi clocks.max.memory = 6001 MHz; Qwen3-1.7B Q4_K_M
decodes 75.4 tok/s, matching the 192 estimate (~78) and not 128's (~52). The
patched detector returns ('NVIDIA GeForce GTX 1650', 192.0) on the card.

Back-compatible: all existing tests pass; mem-clock params default to None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants