GTX 1650: disambiguate GDDR5/GDDR6 variants by memory clock (measured 192 GB/s)#115
Open
cms-pm wants to merge 1 commit into
Open
GTX 1650: disambiguate GDDR5/GDDR6 variants by memory clock (measured 192 GB/s)#115cms-pm wants to merge 1 commit into
cms-pm wants to merge 1 commit into
Conversation
The GTX 1650 ships in two memory configurations the driver name and PCI device
id (0x1F82) cannot tell apart: original GDDR5 (8 Gbps x 128-bit = 128 GB/s) and a
later GDDR6 revision (12 Gbps x 128-bit = 192 GB/s). whichllm's single
"GTX 1650": 128.0 is right for GDDR5 but under-states GDDR6 by 50%, which then
under-predicts tok/s and under-recommends models.
Resolve by max memory clock at detection time (the only reliable discriminator):
- data/gpu.py: GPU_MEMORY_CLOCK_VARIANTS maps GTX 1650 -> 192 when max mem clock
>= 5500 MHz (GDDR6 ~6001) else 128 (GDDR5 ~4001). 128 stays the curated default.
- gpu_db.py: resolve_detected_bandwidth(..., mem_clock_mhz) tries the variant
first, then the existing curated/dbgpu path. Identical behaviour when unknown.
- nvidia.py: capture max mem clock via NVML (NVML_CLOCK_MEM) and nvidia-smi
(clocks.max.memory). The smi query retries without the clock field on error so
a missing optional field can never wipe out detection.
Validation (clock-locked, sole-tenant, build 2cbfdc6 on a GDDR6 board, VBIOS
90.17.4D.00.1E): nvidia-smi clocks.max.memory = 6001 MHz; Qwen3-1.7B Q4_K_M
decodes 75.4 tok/s, matching the 192 estimate (~78) and not 128's (~52). The
patched detector returns ('NVIDIA GeForce GTX 1650', 192.0) on the card.
Back-compatible: all existing tests pass; mem-clock params default to None.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GPU_BANDWIDTH["GTX 1650"] = 128.0is correct for the original GDDR5 card,but the GTX 1650 also shipped in a later GDDR6 revision at 192 GB/s
(12 Gbps × 128-bit). Both use the same TU117 die and PCI device id
0x1F82, andthe driver reports the same name, so the curated single value under-states the
GDDR6 variant by 50% — which under-predicts tok/s and under-recommends models for
a card that is still widely sold.
This PR disambiguates the two by max memory clock, the only reliable
discriminator, and leaves 128 as the conservative default when the clock is
unknown.
Change
data/gpu.py:GPU_MEMORY_CLOCK_VARIANTS—GTX 1650→ 192 GB/s when maxmemory clock ≥ 5500 MHz (GDDR6 ≈ 6001), else 128 (GDDR5 ≈ 4001). The 5500 split
sits well clear of both regimes.
gpu_db.py:resolve_detected_bandwidth(name, vram_bytes, mem_clock_mhz=None)consults the variant table first, then the existing curated/dbgpu path.
Behaviour is byte-identical when
mem_clock_mhzis unknown.nvidia.py: capture max memory clock via NVML (NVML_CLOCK_MEM) and thenvidia-smi fallback (
clocks.max.memory). The smi query retries without theclock field if a driver rejects it, so an optional field can never reduce
detection to zero GPUs.
Validation (measured, not inferred)
Clock-locked, sole-tenant, llama.cpp build
2cbfdc6, on a confirmed GDDR6board (VBIOS
90.17.4D.00.1E,nvidia-smi clocks.max.memory = 6001 MHz):At 128, whichllm under-predicts by ~45%; at 192 it lands within ~6%. The patched
detector returns
('NVIDIA GeForce GTX 1650', 192.0)on the card.Data & reproduction. Methodology and the full four-model calibration are in
the accompanying paper (SSRN 6941538, The kv4 Trade-off Is Workload-Dependent);
the measured evidence + clock-locked harness are public and reproducible at
https://github.com/cms-pm/kv4-edge-inference (
analysis/analyze.pyreprints thetok/s used here from the shipped data). This mirrors how PR #75 cited a paper for
a legacy GPU — but with measured, independently re-runnable numbers.
Tests
tests/test_gtx1650_variants.py— 15 tests: resolver disambiguation incl.threshold boundary and back-compat (unknown clock → 128); nvidia-smi parse incl.
[N/A]clock and the 3-field-fails→2-field-retry regression guard; and ameasured-calibration check (192 estimate scales 1.5× over 128 and lands near the
measured 75.4). Full suite green.
Scope and honest limitations
detect_hardware()callsdetect_nvidia_gpus()on every OS, and that path usespynvml (
nvml.dll) with annvidia-smifallback — both of which ship with theNVIDIA driver on Windows. So a GDDR6 GTX 1650 on a normal Windows box is
disambiguated by the same memory-clock code as Linux. The 128 default is only
reached in the narrow case where NVML and nvidia-smi both fail and detection
falls through to the pure-WMI path (
hardware/windows.py), which has no memoryclock (
Win32_VideoControllerdoes not expose VRAM clock). Teaching thatfallback to shell
nvidia-smiis a small, hardware-free follow-up; flagged hereso the one remaining gap is explicit, not accidental.
ambiguity affects other cards, most notably the GT 1030 (DDR4 ~16 GB/s vs
GDDR5 ~48 GB/s), a 3× error. This PR populates only the GTX 1650 (the variant
measured here); GT 1030 and similar can be added to the same table as data
without further code.
measured here; only the GDDR6 board was measured. The 5500 threshold is safe
given the 8-vs-12 Gbps gap regardless.
Backward compatibility
Additive only. All pre-existing tests pass; the new
mem_clock_mhzparametersdefault to
None, preserving exact prior behaviour for every code path that doesnot supply a clock.