Problem
Adding a new GPU type currently requires updating 3-4 separate files with hardcoded mappings. GPU-related constants are scattered across the codebase:
| Dict/Constant |
File |
Purpose |
GPU_PRODUCT_MAP (17 entries) |
src/planner/cluster/gpu_detector.py |
K8s node labels → canonical names |
CATALOG_TO_ROOFLINE_GPU (8 entries) |
src/planner/recommendation/estimator.py |
Canonical names → llm_optimizer names |
_KNOWN_GPU_TOKENS (11 entries) |
src/planner/shared/utils/gpu_normalizer.py |
Pattern matching fallback list |
| Hardcoded GPU tier check |
src/planner/configuration/generator.py:138-149 |
CPU/memory resource allocation by GPU class |
Meanwhile, data/configuration/gpu_catalog.json already serves as the canonical source of truth with aliases, memory, cost, and node selectors per GPU — but the above mappings duplicate or extend that information independently.
Proposal
Make gpu_catalog.json the single source of truth by adding fields and deriving the scattered mappings from it:
1. Add K8s label values to catalog aliases
GPU_PRODUCT_MAP entries (e.g., "nvidia-h100-sxm5-80gb" → "H100") are effectively aliases. Add them to each GPU's aliases array in gpu_catalog.json, then have gpu_detector.py build its lookup from the catalog instead of a hardcoded dict.
2. Add roofline model name to catalog
Add a "roofline_name" field (e.g., "roofline_name": "A100-40GB" for A100-40) to each GPU entry. estimator.py would read this from the catalog instead of maintaining CATALOG_TO_ROOFLINE_GPU. GPUs without roofline support would have "roofline_name": null.
3. Derive _KNOWN_GPU_TOKENS from catalog
The fuzzy matching token list is just the set of canonical GPU names. It can be built from the catalog's keys at module load time instead of being hardcoded.
4. Add resource tier to catalog
Replace the string-matching logic in generator.py (if "H100" in gpu_type or "H200" in ...) with a "resource_tier" field in the catalog (e.g., "high" vs "standard"), with corresponding CPU/memory presets.
What's fine as-is
GPU_EXPANSIONS (A100 → both memory variants) is normalization logic, not a mapping that tracks GPU additions
gpu_normalizer.py module structure is good — already uses catalog as primary lookup
CostManager in gpu_recommender.py supports custom cost overrides for the CLI; could use catalog as base but is a separate concern
Benefit
Adding a new GPU becomes a single edit to gpu_catalog.json instead of a multi-file scavenger hunt. Reduces risk of mappings getting out of sync.
Problem
Adding a new GPU type currently requires updating 3-4 separate files with hardcoded mappings. GPU-related constants are scattered across the codebase:
GPU_PRODUCT_MAP(17 entries)src/planner/cluster/gpu_detector.pyCATALOG_TO_ROOFLINE_GPU(8 entries)src/planner/recommendation/estimator.py_KNOWN_GPU_TOKENS(11 entries)src/planner/shared/utils/gpu_normalizer.pysrc/planner/configuration/generator.py:138-149Meanwhile,
data/configuration/gpu_catalog.jsonalready serves as the canonical source of truth with aliases, memory, cost, and node selectors per GPU — but the above mappings duplicate or extend that information independently.Proposal
Make
gpu_catalog.jsonthe single source of truth by adding fields and deriving the scattered mappings from it:1. Add K8s label values to catalog aliases
GPU_PRODUCT_MAPentries (e.g.,"nvidia-h100-sxm5-80gb" → "H100") are effectively aliases. Add them to each GPU'saliasesarray ingpu_catalog.json, then havegpu_detector.pybuild its lookup from the catalog instead of a hardcoded dict.2. Add roofline model name to catalog
Add a
"roofline_name"field (e.g.,"roofline_name": "A100-40GB"for A100-40) to each GPU entry.estimator.pywould read this from the catalog instead of maintainingCATALOG_TO_ROOFLINE_GPU. GPUs without roofline support would have"roofline_name": null.3. Derive
_KNOWN_GPU_TOKENSfrom catalogThe fuzzy matching token list is just the set of canonical GPU names. It can be built from the catalog's keys at module load time instead of being hardcoded.
4. Add resource tier to catalog
Replace the string-matching logic in
generator.py(if "H100" in gpu_type or "H200" in ...) with a"resource_tier"field in the catalog (e.g.,"high"vs"standard"), with corresponding CPU/memory presets.What's fine as-is
GPU_EXPANSIONS(A100 → both memory variants) is normalization logic, not a mapping that tracks GPU additionsgpu_normalizer.pymodule structure is good — already uses catalog as primary lookupCostManageringpu_recommender.pysupports custom cost overrides for the CLI; could use catalog as base but is a separate concernBenefit
Adding a new GPU becomes a single edit to
gpu_catalog.jsoninstead of a multi-file scavenger hunt. Reduces risk of mappings getting out of sync.