Skip to content

refactor: consolidate GPU mappings to use gpu_catalog.json as single source of truth #162

@anfredette

Description

@anfredette

Problem

Adding a new GPU type currently requires updating 3-4 separate files with hardcoded mappings. GPU-related constants are scattered across the codebase:

Dict/Constant File Purpose
GPU_PRODUCT_MAP (17 entries) src/planner/cluster/gpu_detector.py K8s node labels → canonical names
CATALOG_TO_ROOFLINE_GPU (8 entries) src/planner/recommendation/estimator.py Canonical names → llm_optimizer names
_KNOWN_GPU_TOKENS (11 entries) src/planner/shared/utils/gpu_normalizer.py Pattern matching fallback list
Hardcoded GPU tier check src/planner/configuration/generator.py:138-149 CPU/memory resource allocation by GPU class

Meanwhile, data/configuration/gpu_catalog.json already serves as the canonical source of truth with aliases, memory, cost, and node selectors per GPU — but the above mappings duplicate or extend that information independently.

Proposal

Make gpu_catalog.json the single source of truth by adding fields and deriving the scattered mappings from it:

1. Add K8s label values to catalog aliases

GPU_PRODUCT_MAP entries (e.g., "nvidia-h100-sxm5-80gb" → "H100") are effectively aliases. Add them to each GPU's aliases array in gpu_catalog.json, then have gpu_detector.py build its lookup from the catalog instead of a hardcoded dict.

2. Add roofline model name to catalog

Add a "roofline_name" field (e.g., "roofline_name": "A100-40GB" for A100-40) to each GPU entry. estimator.py would read this from the catalog instead of maintaining CATALOG_TO_ROOFLINE_GPU. GPUs without roofline support would have "roofline_name": null.

3. Derive _KNOWN_GPU_TOKENS from catalog

The fuzzy matching token list is just the set of canonical GPU names. It can be built from the catalog's keys at module load time instead of being hardcoded.

4. Add resource tier to catalog

Replace the string-matching logic in generator.py (if "H100" in gpu_type or "H200" in ...) with a "resource_tier" field in the catalog (e.g., "high" vs "standard"), with corresponding CPU/memory presets.

What's fine as-is

  • GPU_EXPANSIONS (A100 → both memory variants) is normalization logic, not a mapping that tracks GPU additions
  • gpu_normalizer.py module structure is good — already uses catalog as primary lookup
  • CostManager in gpu_recommender.py supports custom cost overrides for the CLI; could use catalog as base but is a separate concern

Benefit

Adding a new GPU becomes a single edit to gpu_catalog.json instead of a multi-file scavenger hunt. Reduces risk of mappings getting out of sync.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions