From 304198bd5afde9ff8a51985ca64ffb5f0d7a3665 Mon Sep 17 00:00:00 2001 From: Andre Fredette Date: Wed, 18 Feb 2026 12:29:19 -0500 Subject: [PATCH 1/3] docs: add comparative evaluation of config_explorer vs NeuralNav Structured technical evaluation covering functional comparison, architectural analysis, overlap/complementarity assessment, and integration feasibility between llm-d-benchmark's config_explorer and NeuralNav. Proposes offline synthetic benchmark generation as the recommended integration path. Assisted-by: Claude Signed-off-by: Andre Fredette --- docs/CONFIG_EXPLORER_EVALUATION.md | 376 +++++++++++++++++++++++++++++ 1 file changed, 376 insertions(+) create mode 100644 docs/CONFIG_EXPLORER_EVALUATION.md diff --git a/docs/CONFIG_EXPLORER_EVALUATION.md b/docs/CONFIG_EXPLORER_EVALUATION.md new file mode 100644 index 0000000..f792f35 --- /dev/null +++ b/docs/CONFIG_EXPLORER_EVALUATION.md @@ -0,0 +1,376 @@ +# Comparative Technical Evaluation: config_explorer vs NeuralNav + +## Executive Summary + +**config_explorer** (llm-d-benchmark) and **NeuralNav** solve related but architecturally distinct problems in the LLM deployment space. config_explorer is a bottom-up infrastructure analysis tool built specifically for the llm-d stack, with prefill/decode (P/D) disaggregation as a first-class concern. NeuralNav is a top-down deployment guidance system that translates business requirements into ranked, SLO-compliant model+GPU configurations using empirical benchmark data -- currently focused on aggregated vLLM deployments, with llm-d support as a stated goal. + +Their overlap is real but narrow -- concentrated in the "given a model and workload, which GPU configuration works?" question. The most promising integration path is **offline synthetic benchmark generation**: using config_explorer to pre-compute estimated performance data in NeuralNav's benchmark format, expanding NeuralNav's recommendation space with clearly labeled confidence levels. A full codebase merge would require aligning on shared objectives and operating as a single open-source team. + +## 1. Functional Comparison + +### What config_explorer solves + +config_explorer answers: **"Can this model physically run on this GPU, and what performance can I expect -- especially on llm-d with P/D disaggregation?"** + +Three capabilities: + +1. **Capacity Planning** (`capacity_planner.py`): Calculates GPU memory breakdown -- model weights, KV cache (MHA/GQA/MQA/MLA-aware), activation memory, CUDA overhead -- to determine whether a model fits on a given GPU with a given parallelism strategy (TP/PP/DP/EP). Reports max concurrent requests. Empirically validated against H100 GPUs. Generic to vLLM, not llm-d-specific. +2. **GPU Recommendation** (`recommender/recommender.py`): Uses BentoML's `llm-optimizer` roofline model to synthetically estimate throughput, TTFT, ITL, and E2E latency across GPU types. Ranks by best throughput, lowest latency, or lowest cost. Operates on aggregated vLLM only -- does **not** model P/D disaggregation. +3. **Benchmark Exploration** (`explorer.py`): Loads llm-d benchmark report files (YAML, schema v0.1/v0.2) into Pandas DataFrames with 84 columns. Supports SLO filtering, Pareto-front analysis, and scenario-based visualization. This module is **deeply llm-d-specific** -- its data model has first-class columns for P/D disaggregation (`P_TP`, `D_TP`, `P_Replicas`, `D_Replicas`, `Is_PD`) and llm-d inference scheduler parameters (`KV_Cache_Scorer_Weight`, `Queue_Scorer_Weight`, `Prefix_Cache_Scorer_*`). + +### What NeuralNav solves + +NeuralNav answers: **"Given my business needs, what model+GPU deployment should I use, and how do I get it running?"** + +Four-stage workflow: + +1. **Intent Extraction**: LLM-powered NLU converts natural language ("chatbot for 1000 users") into structured intent (use case, user count, priorities). +2. **Specification Generation**: Maps intent to GuideLLM traffic profile + SLO targets (9 use cases, 4 GuideLLM traffic profiles). +3. **Recommendation Engine**: Queries PostgreSQL for all (model, GPU, TP) combos meeting SLO targets from real benchmark data (produced by running GuideLLM against real models on real GPUs). Scores on 4 dimensions (accuracy, price, latency, complexity), generates 5 ranked views. +4. **Configuration & Deployment**: Generates KServe/vLLM YAML via Jinja2 templates, deploys to Kubernetes, monitors health. + +**Important context**: While BLIS simulator benchmarks are currently checked into the repo, NeuralNav's primary focus is on **real benchmarks** produced by running GuideLLM against real models on real GPUs. The reliance on empirical data limits the range of configurations NeuralNav can recommend to only those that have been actually tested. + +### Where responsibilities intersect + +| Concern | config_explorer | NeuralNav | +| --- | --- | --- | +| "Will this model fit on this GPU?" | Yes (memory analysis, pre-benchmark) | Yes, indirectly (benchmark existence proves feasibility, but only for benchmarked combos) | +| "What latency/throughput will I get?" | Yes (synthetic roofline) | Yes (empirical benchmark lookup) | +| GPU cost comparison | Yes (hourly x GPU count) | Yes (hourly x GPU count x replicas x 730h) | +| SLO compliance filtering | Yes (constraint strings) | Yes (p95 SQL WHERE clauses) | +| Multi-criteria ranking | Partial (best per dimension, no composite) | Yes (4-dimensional weighted scoring + 5 views) | +| Benchmark data exploration | Yes (Pandas + Pareto + visualization) | No (benchmarks are opaque input) | + +### Where they are clearly different + +| Capability | config_explorer | NeuralNav | +| --- | --- | --- | +| NLU intent extraction | No | Yes (Ollama LLM) | +| Business-context mapping | No | Yes (use case -> traffic -> SLO) | +| Model quality scoring | No | Yes (Artificial Analysis benchmarks) | +| Multi-model comparison | No (single model per run) | Yes (all models meeting SLO, ranked) | +| Replica/scaling calculations | No | Yes (QPS-based planning) | +| YAML generation and K8s deployment | No | Yes (full lifecycle) | +| GPU memory estimation from architecture | Yes (empirically validated) | No | +| Synthetic performance estimation | Yes (BentoML roofline) | No | +| P/D disaggregation modeling | Yes (core to explorer.py data model) | No (aggregated vLLM only; llm-d support is a goal) | +| MoE/MLA/attention-type analysis | Yes | No | +| llm-d scheduler parameter tuning | Yes (KV cache, queue, prefix cache scorer weights) | No | + +## 2. Architectural Comparison + +### Core abstractions + +| Aspect | config_explorer | NeuralNav | +| --- | --- | --- | +| Primary abstraction | `KVCacheDetail` dataclass (memory) + `ColumnProperties` dict (84-column benchmark DataFrame) | `DeploymentRecommendation` Pydantic model (scored config) | +| GPU representation | `GPU_SPECS` (BentoML hardware database) + `db.json` (memory specs) + `gpu_costs.json` (pricing) | `gpu_types` in `model_catalog.json` (pricing + memory) | +| Performance data source | BentoML `PerformanceEstimationResult` (synthetic) + benchmark report YAML (empirical) | PostgreSQL `exported_summaries` (empirical from GuideLLM) | +| User input | CLI args / Streamlit widgets (model ID, token lengths, GPU constraints) | Natural language -> `DeploymentIntent` Pydantic model | +| Scoring model | Per-dimension best (no composite score) | 4-dimension weighted composite with scalability penalty | + +### Data models + +**config_explorer**: + +- `KVCacheDetail` dataclass: attention type, precision, layers, heads, per-token memory, total KV cache +- `PerformanceEstimationParams`/`Result` (BentoML): model, input/output len, GPU, num_gpus, framework, constraints -> throughput, TTFT, ITL, E2E +- Benchmark DataFrame: 84 columns covering run metadata, configuration (incl. P/D disagg + scheduler params), workload, and metrics (all latency percentiles from p0.1 to p99.9) +- `Scenario` dataclass: Streamlit session state +- `CostManager`: GPU name -> hourly cost + +**NeuralNav**: + +- `DeploymentIntent`: use_case, user_count, priorities, GPU preferences +- `TrafficProfile`: prompt_tokens, output_tokens, expected_qps +- `SLOTargets`: ttft_p95_target_ms, itl_p95_target_ms, e2e_p95_target_ms +- `GPUConfig`: gpu_type, gpu_count, tensor_parallel, replicas +- `ConfigurationScores`: accuracy/price/latency/complexity scores (0-100), balanced_score, slo_status +- `BenchmarkData`: flat row with mean/p90/p95/p99 for TTFT/ITL/E2E/TPS + requests_per_second + estimated flag +- `DeploymentRecommendation`: all above + cost + reasoning + +### Benchmark data formats -- compatibility + +| Field | llm-d benchmark report | config_explorer DataFrame | NeuralNav BenchmarkData | +| --- | --- | --- | --- | +| Model ID | `scenario.model.name` | `Model` column | `model_hf_repo` | +| GPU type | `scenario.host.accelerator.model` | `GPU` column | `hardware` | +| GPU count | `scenario.host.accelerator.count` | `Num_GPUs` column | `hardware_count` | +| Input tokens | `metrics.requests.input_length` | `ISL` column | `prompt_tokens` | +| Output tokens | `metrics.requests.output_length` | `OSL` column | `output_tokens` | +| TTFT | `metrics.latency.ttft` (Statistics with all percentiles) | `Mean_TTFT_ms` through `P99_TTFT_ms` | `ttft_mean`, `ttft_p90`, `ttft_p95`, `ttft_p99` | +| ITL | `metrics.latency.itl` | Same pattern | Same pattern | +| E2E | `metrics.latency.request_latency` | `Mean_E2EL_ms` through `P99_E2EL_ms` | `e2e_mean` through `e2e_p99` | +| Throughput | `metrics.throughput.*_per_sec` | `Request_Throughput`, `Output_Token_Throughput` | `requests_per_second`, `tokens_per_second` | +| P/D disagg | `host.type` (REPLICA/PREFILL/DECODE) | `Is_PD`, `P_TP`, `D_TP`, `P_Replicas`, `D_Replicas` | **Not supported** | +| Scheduler params | Not in v0.1 schema | `KV_Cache_Scorer_Weight`, etc. | **Not supported** | + +**Conversion feasibility**: ~80-90% of core performance fields map directly between formats. P/D disaggregation and scheduler parameters have no NeuralNav equivalent yet. Unit normalization (seconds vs milliseconds) required. + +### Extensibility + +- **config_explorer**: Add GPUs via `gpu_costs.json` + BentoML `GPU_SPECS`. Add models by pointing at any HuggingFace model ID. Add benchmark analysis via Pandas operations. llm-d benchmark schema is versioned (v0.1, v0.2). +- **NeuralNav**: Add GPUs via `model_catalog.json`. Add models to catalog + run GuideLLM benchmarks + load to PostgreSQL. Add use cases via `slo_templates.json` + Artificial Analysis weighted score CSVs. + +### API surface + +- **config_explorer**: Python library API + CLI (`config-explorer plan|estimate|start`) + Streamlit UI. No REST API. +- **NeuralNav**: FastAPI REST API (`/api/v1/*`) + Python library API (service classes) + Streamlit UI. + +## 3. Overlap Analysis + +### Shared concepts + +| Concept | config_explorer term | NeuralNav term | +| --- | --- | --- | +| Latency metrics | TTFT, ITL, E2E latency | TTFT p95, ITL p95, E2E p95 | +| Token lengths | input_len, output_len / ISL, OSL | prompt_tokens, output_tokens | +| GPU specification | GPU name + memory + count | gpu_type + gpu_count + tensor_parallel | +| Performance constraints | max_ttft, max_itl, max_latency | SLOTargets | +| Cost calculation | `CostManager.get_cost()` | `ModelCatalog.calculate_gpu_cost()` | +| Tensor parallelism | TP (from `find_possible_tp()`) | tensor_parallel (from benchmark data) | +| Estimated vs real data | (not explicitly tracked) | `estimated` flag in `benchmark_metrics` | + +### Similar workflows + +Both follow: "take a model + workload + GPU -> estimate/lookup performance -> filter by SLO -> rank by preference." The differences: + +- **Data source**: synthetic roofline (config_explorer) vs empirical benchmarks (NeuralNav) +- **Scope**: single model per run (config_explorer) vs all qualifying models ranked (NeuralNav) +- **Deployment topology**: P/D disaggregation + scheduler tuning (config_explorer) vs aggregated only (NeuralNav) + +### Redundant capabilities + +1. **GPU cost lookups**: Both maintain separate GPU pricing tables. Could share a source of truth. +2. **SLO filtering**: Both filter by latency constraints -- different mechanisms (BentoML constraint strings vs SQL WHERE), same logic. +3. **"Best by metric" queries**: config_explorer's `get_gpu_with_lowest_cost()` etc. overlap with NeuralNav's ranked views, but NeuralNav operates across models and config_explorer across GPUs. + +### Competing abstractions + +- **Performance data**: `PerformanceEstimationResult` (opaque BentoML object) vs `BenchmarkData` (flat PostgreSQL row). Structurally incompatible. +- **GPU specs**: `GPU_SPECS` (BentoML hardware compute specs) vs `model_catalog.json gpu_types` (pricing-focused). Different schemas, different purposes. +- **Deployment topology**: config_explorer's 84-column DataFrame includes P/D disaggregation as native columns. NeuralNav's `GPUConfig` has no equivalent -- adding it requires schema changes. + +## 4. Complementarity Analysis + +### How config_explorer could enhance NeuralNav + +1. **Fill benchmark gaps with synthetic estimates**: NeuralNav can only recommend (model, GPU) combos that have been empirically benchmarked via GuideLLM. This fundamentally limits the recommendation space. config_explorer's roofline model could produce estimated performance for unbenchmarked combos. These could be presented with **lower confidence** compared to real benchmarks, giving users a broader view of options. This directly addresses the Phase 2 TODO in `src/neuralnav/recommendation/config_finder.py:14-18`. + +2. **Offline synthetic benchmark generation**: Rather than calling config_explorer at runtime, use it as an **offline tool** to pre-generate synthetic benchmarks in NeuralNav's `BenchmarkData` format (or `benchmarks_BLIS.json` schema). These would be loaded into PostgreSQL alongside real benchmarks, flagged with `estimated=True`. This avoids runtime dependencies entirely. + +3. **Memory-feasibility pre-filtering**: Before querying benchmarks, NeuralNav could use `capacity_planner` to verify that a model physically fits on a GPU -- useful for error messaging and validating benchmark data integrity. + +4. **Path to llm-d support**: config_explorer's data model already handles P/D disaggregation natively. As NeuralNav adds llm-d support, config_explorer's `explorer.py` column schema and benchmark report loading could inform NeuralNav's data model evolution -- particularly how to represent disaggregated configurations in `GPUConfig` and `BenchmarkData`. + +5. **KV cache and concurrency analysis**: `capacity_planner.max_concurrent_requests()` could improve NeuralNav's replica calculations beyond the current `ceil(required_qps * 1.2 / benchmark_rps)`. + +### How NeuralNav could enhance config_explorer + +1. **Business-context-aware recommendations**: config_explorer recommends GPUs in isolation. NeuralNav's use-case -> SLO -> multi-model ranking pipeline gives business-aligned recommendations. + +2. **Model quality awareness**: config_explorer has no concept of model quality -- it would happily recommend a fast but low-quality model. NeuralNav's Artificial Analysis scoring prevents this. + +3. **End-to-end deployment**: config_explorer stops at recommendation. NeuralNav's YAML generation and K8s deployment turns recommendations into running services. + +4. **Multi-model fleet comparison**: NeuralNav compares all qualifying models in a single request. config_explorer analyzes one model at a time. + +### Integration leverage points + +- **`capacity_planner` functions** -> NeuralNav's `ConfigFinder._calculate_required_replicas()` for better replica estimation +- **`GPURecommender`** -> Offline batch synthetic benchmark generation script +- **`explorer.py` data model** -> Reference for extending NeuralNav's schemas to support P/D disaggregation +- **llm-d benchmark report schema** -> Standard format for real benchmark data ingestion into NeuralNav + +## 5. Feasibility of Unification + +### Codebase compatibility + +| Factor | Assessment | +| --- | --- | +| Language | Both Python -- compatible | +| Python version | config_explorer >=3.11, NeuralNav >=3.10 -- compatible | +| Package management | setuptools/pip vs uv -- minor friction, uv can install pip packages | +| Data validation | Both Pydantic >=2 -- compatible | +| UI framework | Both Streamlit -- compatible | +| Backend framework | config_explorer has no REST API; NeuralNav uses FastAPI | +| Key external dependency | `llm-optimizer` (BentoML, git-only, no PyPI) adds supply chain risk | + +### Refactoring effort + +| Integration type | Effort | Key work | +| --- | --- | --- | +| Offline synthetic benchmark generator | Low | Script to run GPURecommender batch, format output as NeuralNav JSON | +| Use config_explorer as pip dependency | Low-Medium | Add dependency, write adapter for data model translation | +| Embed capacity planning into NeuralNav | Medium | Wrap `capacity_planner` behind service interface, add HF dependency | +| Extend NeuralNav schemas for P/D disagg | Medium-High | Schema changes to GPUConfig, BenchmarkData, UI, PostgreSQL | +| Full merge of codebases | Very High | Reconcile data models, UIs, CLIs, dependencies, governance | + +### Risk areas + +1. **BentoML `llm-optimizer` dependency**: External git dependency with no PyPI release. Brings scipy, transformers, huggingface_hub. Supply chain and build complexity risk -- especially for offline generation, this may be acceptable since it only runs in dev/CI, not production. +2. **Synthetic vs empirical accuracy**: Roofline estimates are approximations. If mixed with empirical data, users must clearly understand confidence differences. The `estimated` flag in NeuralNav's `benchmark_metrics` exists for this purpose. +3. **HuggingFace API dependency**: config_explorer fetches model configs at runtime. For offline generation this is acceptable; for runtime integration it adds latency and failure modes. +4. **Ownership divergence**: config_explorer lives in `llm-d/llm-d-benchmark`. API changes could break downstream integrations. Offline generation is less sensitive to this since the script can pin versions. +5. **Full merge governance**: Combining codebases requires agreeing on common objectives, shared roadmap, unified release process, and operating as a single OSS team. This is an organizational decision, not just a technical one. + +### Long-term maintainability + +- **Offline generation** (Option A): Lowest burden. config_explorer is a dev tool, not a runtime dependency. NeuralNav consumes its output (JSON files), not its code. +- **Loose coupling** (Option B): Low burden. Adapter layer absorbs API changes. +- **Full merge** (Option C): Highest burden. Requires sustained alignment between teams on priorities, release cadence, and architecture direction. Only justified if both projects share a single roadmap. + +## 6. Proposed Integration Models + +### Option A: Offline Synthetic Benchmark Generation (Recommended) + +Use config_explorer as an **offline dev/CI tool** to batch-generate synthetic performance estimates in NeuralNav's benchmark data format. No runtime dependency. + +**Implementation**: + +- Create a standalone script (e.g., `scripts/generate_synthetic_benchmarks.py`) that: + 1. Takes a list of (model_id, gpu_type, gpu_count) combos from NeuralNav's `model_catalog.json` + 2. For each combo, calls `GPURecommender` with the 4 GuideLLM traffic profiles (512->256, 1024->1024, 4096->512, 10240->1536) + 3. Translates `PerformanceEstimationResult` -> NeuralNav `benchmarks_BLIS.json` entry format + 4. Generates synthetic percentiles from mean values (p90 ~ mean x 1.05, p95 ~ mean x 1.10, p99 ~ mean x 1.20 -- or use distribution models) + 5. Writes output as JSON with `estimated: true` flag + 6. Loads into PostgreSQL alongside real benchmarks +- NeuralNav's existing `ConfigFinder` processes synthetic benchmarks identically to real ones -- the `estimated` flag is already in the schema +- UI shows confidence indication (e.g., "Estimated" vs "Benchmarked") on recommendation cards + +**Pros**: + +- Zero runtime dependency on config_explorer or llm-optimizer +- NeuralNav's production code doesn't change (except optional UI confidence indicator) +- Can run in CI to regenerate estimates when model catalog changes +- Clear separation: offline tool generates data, NeuralNav consumes data +- config_explorer API changes only affect the generation script, not NeuralNav core +- Expands recommendation space to any HuggingFace model, not just benchmarked ones + +**Cons**: + +- Synthetic percentile generation is heuristic -- no real distribution data +- Must rerun script when adding models or GPUs +- BentoML roofline accuracy is unvalidated against NeuralNav's GuideLLM real benchmarks +- Doesn't help with P/D disaggregation (GPURecommender doesn't model disaggregation) + +**Complexity**: Low + +**Risk**: Low (isolated to offline tooling) + +### Option B: Loose Coupling (Runtime Library Integration) + +NeuralNav imports config_explorer as a pip dependency, calls it through a thin adapter as a fallback when empirical benchmarks are unavailable. + +**Implementation**: + +- Add `config_explorer` to NeuralNav's `pyproject.toml` +- Create `src/neuralnav/adapters/config_explorer_adapter.py` +- Modify `ConfigFinder.plan_all_capacities()` to call adapter for unbenchmarked combos +- Use `capacity_planner` for memory validation + +**Pros**: + +- Real-time synthetic estimation for any model +- Memory validation before recommendation +- Progressive enhancement: empirical data preferred, synthetic as fallback + +**Cons**: + +- Brings `llm-optimizer` + `transformers` + `huggingface_hub` into NeuralNav's dependency tree at runtime +- Runtime HuggingFace API calls add latency and failure modes +- Data model impedance mismatch +- No control over config_explorer API stability + +**Complexity**: Medium + +**Risk**: Medium + +### Option C: Full Merge + +Merge both projects into a single codebase with unified data models, shared UI, and common governance. + +**Implementation**: + +- Agree on shared project objectives and roadmap +- Form a single OSS team with shared ownership +- Reconcile all data models (GPUConfig, BenchmarkData to support P/D disaggregation) +- Merge UI pages (NeuralNav's recommendation flow + config_explorer's capacity planning + sweep visualization) +- Unified CLI and REST API +- Single dependency tree + +**Pros**: + +- Tightest possible integration +- Unified data model supporting both aggregated and disaggregated deployments +- Single user experience: from capacity analysis -> recommendation -> deployment +- No adapter overhead or format translation +- Shared roadmap eliminates divergence + +**Cons**: + +- Requires organizational alignment: common objectives, shared governance, unified release process +- Very high engineering effort to reconcile data models, UIs, and dependencies +- Risk of scope bloat -- combining a focused tool with a comprehensive system +- BentoML `llm-optimizer` becomes a core dependency of the merged project +- Forks from upstream config_explorer -- future upstream improvements must be manually ported or the merge must be the new upstream + +**Complexity**: Very High + +**Risk**: High (organizational + technical) + +## 7. Proof-of-Concept Plan + +### Goal + +Validate that config_explorer's synthetic roofline estimates are accurate enough to usefully expand NeuralNav's recommendation space, using **Option A (offline generation)** as the approach. + +### Success Criteria + +1. **Accuracy**: For (model, GPU, traffic_profile) combos where NeuralNav has real GuideLLM benchmarks, the roofline estimates agree within 30% on TTFT, ITL, and E2E latency for the majority (>60%) of tested combos. +2. **Coverage expansion**: The synthetic generator produces valid estimates for at least 3 model+GPU combos that NeuralNav currently cannot recommend. +3. **End-to-end flow**: Synthetic benchmarks loaded into NeuralNav are scored, ranked, and displayed with an "Estimated" indicator -- no existing recommendation quality degraded. + +### Implementation Steps + +#### Step 1: Accuracy validation (standalone script, no NeuralNav changes) + +- Script reads NeuralNav's `benchmarks_BLIS.json` +- For each (model, GPU, input_tokens, output_tokens) combo, calls `GPURecommender` +- Compares synthetic TTFT/ITL/E2E with empirical values +- Reports: correlation, mean absolute error, % within 30% +- **Output**: Accuracy report determining whether Step 2 is worth pursuing + +#### Step 2: Offline synthetic benchmark generator + +- Script iterates over all (model, GPU) combos from `model_catalog.json` +- For each of the 4 GuideLLM traffic profiles, runs `GPURecommender` +- Formats output as `benchmarks_BLIS.json` entries with `estimated: true` +- Writes to `data/benchmarks/performance/benchmarks_estimated.json` + +#### Step 3: Load and verify + +- Load synthetic benchmarks into PostgreSQL alongside real benchmarks +- Run NeuralNav's recommendation endpoint with a use case that benefits from expanded coverage +- Verify synthetic results appear in ranked recommendations with correct scoring +- Verify existing real-benchmark recommendations are unchanged + +#### Step 4: UI confidence indicator (optional) + +- Add "Estimated" vs "Benchmarked" badge to recommendation cards based on `benchmark_metrics.estimated` flag + +### Failure Indicators + +- Roofline estimates diverge >50% from real benchmarks for majority of combos -> synthetic data not trustworthy +- `llm-optimizer` cannot be installed alongside NeuralNav's dependencies -> offline generation requires isolated virtualenv (adds friction but not a blocker) +- GPURecommender fails for models in NeuralNav's catalog (gated HF models, unsupported architectures) -> limited coverage expansion + +### Decision Framework + +| POC Outcome | Recommendation | +| --- | --- | +| Accuracy <30% error for >60% of combos | Proceed with Option A (offline generation) | +| Accuracy 30-50% error | Proceed with caveats -- label estimates as "rough estimates" in UI | +| Accuracy >50% error | Do not integrate -- synthetic data not useful for NeuralNav's ranking | +| Independent of accuracy | Use config_explorer's P/D disaggregation data model as reference for NeuralNav's llm-d support roadmap | From a2f1e11b339ad16d26a64b59f58f20a9822151b0 Mon Sep 17 00:00:00 2001 From: Andre Fredette Date: Wed, 18 Feb 2026 17:02:49 -0500 Subject: [PATCH 2/3] docs: add proposal for unifying config_explorer and NeuralNav Makes the case for merging both projects into a single upstream tool in the llm-d organization. Covers strategic rationale, unified architecture, phased integration roadmap, and risk analysis with mitigations. Signed-off-by: Andre Fredette --- docs/CE_NN_INTEGRATION_PROPOSAL.md | 103 +++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 docs/CE_NN_INTEGRATION_PROPOSAL.md diff --git a/docs/CE_NN_INTEGRATION_PROPOSAL.md b/docs/CE_NN_INTEGRATION_PROPOSAL.md new file mode 100644 index 0000000..a8d2558 --- /dev/null +++ b/docs/CE_NN_INTEGRATION_PROPOSAL.md @@ -0,0 +1,103 @@ +# Proposal: Unifying config_explorer and NeuralNav into a Single Upstream Project + +## The Problem + +Deploying LLMs in production is a multi-step decision process. Users start with a general use case -- a chatbot, a code assistant, a document summarization service -- and need to work through a series of decisions: + +1. **Determine requirements**: What latency, throughput, and quality targets does this use case demand? What traffic patterns should the deployment handle? +2. **Evaluate options**: Which combinations of models and hardware can meet those requirements? What are the trade-offs in cost, accuracy, latency, and deployment complexity across viable configurations? +3. **Make a decision and deploy**: Choose a configuration, generate deployment manifests, apply them to a cluster, and verify the service meets expectations. + +**NeuralNav** guides users through this entire process: describe a use case in natural language, and NeuralNav extracts the requirements, evaluates model+GPU configurations against real benchmark data, ranks options across multiple dimensions (cost, quality, latency, complexity), and deploys the chosen configuration to Kubernetes. The limitation is **coverage** -- NeuralNav can only evaluate configurations that have been empirically benchmarked, and currently only supports aggregated vLLM deployments. + +**config_explorer** (in llm-d-benchmark) expands what can be evaluated. Its GPU memory analysis determines hardware feasibility for virtually any model, its roofline model produces synthetic performance estimates for unbenchmarked configurations, and its benchmark explorer analyzes llm-d deployments including P/D disaggregation. It provides the breadth that NeuralNav's benchmark-dependent approach lacks, but without the guided workflow, multi-criteria ranking, or deployment automation. + +This proposal argues that combining these projects into a single upstream tool in the llm-d organization would produce a system that is meaningfully better than either project alone. + +## The Case for Integration + +### 1. NeuralNav's guided workflow + config_explorer's expanded reach + +NeuralNav already provides a complete end-to-end workflow: a user describes their use case in natural language, NeuralNav extracts the important details (model requirements, traffic profile, SLO targets), produces ranked recommendations backed by real benchmark data, generates deployment manifests, and deploys to Kubernetes. That guided approach is the core value -- users don't need to know what GPU memory their model requires or what tensor parallelism strategy to use. + +The limitation is **coverage**. NeuralNav can only recommend configurations that exist in its benchmark database -- specific combinations of models, GPUs, GPU counts, and traffic profiles that have been tested with GuideLLM. If a promising model hasn't been benchmarked on a particular GPU, it doesn't appear as an option. If the user's workload has different token length distributions than the four GuideLLM profiles, there's no data to draw from. + +config_explorer removes that ceiling. Its roofline model can estimate performance for virtually any model on any GPU, and its capacity planner can verify memory feasibility before anything is benchmarked. In a unified system: + +- NeuralNav's intent extraction produces the structured specification (use case, token lengths, SLO targets) that would otherwise require the user to manually figure out and plug into config_explorer's CLI +- config_explorer's engines expand the recommendation space beyond what's been empirically tested +- The system presents **validated options** from real benchmarks alongside **promising alternatives** that include potentially better models, different GPU configurations, or different data patterns that haven't been benchmarked yet +- Users get a guided experience with broader coverage, rather than having to choose between a guided tool with limited options or a manual tool with unlimited options + +### 2. The capabilities are complementary, not competitive + +The two projects have remarkably little overlap in their core logic: + +| Capability | config_explorer | NeuralNav | Overlap? | +| --- | --- | --- | --- | +| GPU memory estimation (model weights, KV cache, activation) | Yes | No | None | +| Attention-type-aware analysis (MHA/GQA/MQA/MLA) | Yes | No | None | +| Synthetic performance estimation (roofline model) | Yes | No | None | +| P/D disaggregation data model | Yes | No | None | +| llm-d scheduler parameter analysis | Yes | No | None | +| Benchmark sweep visualization (Pareto fronts) | Yes | No | None | +| Natural language intent extraction | No | Yes | None | +| Use case to SLO mapping (9 use cases, 4 traffic profiles) | No | Yes | None | +| Model quality scoring | No | Yes | None | +| Multi-criteria ranking (4 dimensions, 5 ranked views) | No | Yes | None | +| Replica and scaling calculations | No | Yes | None | +| YAML generation (KServe/vLLM/HPA/ServiceMonitor) | No | Yes | None | +| Kubernetes deployment and lifecycle management | No | Yes | None | +| SLO compliance filtering | Yes | Yes | Shared concept, different implementation | +| GPU cost lookups | Yes | Yes | Shared concept, different data sources | + +The overlap is limited to two shared concepts (SLO filtering and GPU pricing), both of which would benefit from a single source of truth rather than competing implementations. + +### 3. NeuralNav's multi-factor decision matrix becomes the unified ranking layer + +config_explorer can tell you which GPU runs a model and what performance to expect, but it has no concept of whether the model is any good at the task you need it for. It ranks GPUs, not deployment decisions. + +NeuralNav factors **model accuracy and quality** benchmarks into its recommendation scoring. A model that's fast and cheap but produces low-quality output for your use case is ranked accordingly -- it won't be recommended as "best" just because it fits on a single GPU. + +This matters for a unified system because config_explorer's roofline model will surface many more (model, GPU) configurations than currently available. Without quality-aware ranking, users would drown in options with no way to distinguish a capable model from a fast-but-mediocre one. NeuralNav's scoring engine ensures that expanded coverage doesn't come at the cost of recommendation quality. + +The decision matrix is also designed to grow. Beyond accuracy and latency, NeuralNav plans to incorporate **model safety scoring** (toxicity, bias, alignment characteristics) and potentially other factors like license compliance and ecosystem maturity. These dimensions apply equally to benchmarked and synthetically estimated configurations, making the unified ranking more valuable as the decision matrix expands. + +### 4. Synthetic + empirical performance data is more powerful than either alone + +NeuralNav's biggest limitation today is that it can only recommend (model, GPU) configurations that have been empirically benchmarked. If a model hasn't been tested on a specific GPU with GuideLLM, it simply doesn't appear as an option. This creates a cold-start problem: every new model or GPU requires a benchmark run before it can be recommended. + +config_explorer's roofline model can produce synthetic estimates for *any* model on *any* GPU, instantly. These estimates are less accurate than real benchmarks but far better than no data at all. + +A unified system can present recommendations with **tiered confidence**: + +- **High confidence**: Configurations backed by real GuideLLM benchmark data +- **Medium confidence**: Configurations with synthetic roofline estimates, validated by memory feasibility analysis +- **Exploratory**: Configurations from capacity planning alone (feasible but uncharacterized performance) + +This gives users a complete picture: proven options alongside promising alternatives worth benchmarking. + +### 5. The benchmark feedback loop becomes automatic + +Today, benchmark data flows in one direction: llm-d-benchmark produces it, and NeuralNav (or similar tools) consume it. In a unified project: + +1. **Benchmarks feed recommendations**: Real benchmark data from GuideLLM runs populates the recommendation database +2. **Recommendations drive benchmarks**: When the system identifies high-value (model, GPU) combos that lack empirical data (only synthetic estimates), it can prioritize those for the next benchmark run +3. **Deployments validate both**: Actual deployment outcomes feed back to calibrate both synthetic estimates and benchmark-based predictions + +This closed loop improves recommendation quality over time without manual coordination between separate projects. + +### 6. NeuralNav needs P/D disaggregation support + +NeuralNav currently only supports aggregated vLLM deployments. Adding llm-d P/D disaggregation support is a stated goal but would require significant schema work: + +- Extending `GPUConfig` to represent separate prefill/decode configurations +- Extending `BenchmarkData` to store disaggregated metrics +- Extending the recommendation engine to compare aggregated vs disaggregated topologies +- Building UI for disaggregated deployment visualization + +config_explorer's `explorer.py` already has all of this modeled in its 84-column DataFrame schema (`Is_PD`, `P_TP`, `D_TP`, `P_Replicas`, `D_Replicas`, scheduler weights). In a unified project, this becomes the foundation rather than something NeuralNav has to reinvent. + +## Conclusion + +The strongest argument for integration is not that it reduces engineering effort -- it doesn't in the short term. The argument is that **the unified product is meaningfully better than either project alone**. A tool that can analyze GPU memory feasibility, estimate performance synthetically, rank configurations against real benchmarks with quality and safety awareness, generate deployment manifests, and deploy to Kubernetes -- all in one workflow -- is qualitatively different from two tools that each cover part of that path. As the decision matrix grows to include safety, license compliance, and other factors, the value of a single ranking engine that applies these dimensions to both benchmarked and synthetically estimated configurations compounds. From d8f55f3e1f75edb9ab0cc4d23b6ecc353dbdd9c1 Mon Sep 17 00:00:00 2001 From: Andre Fredette Date: Tue, 3 Mar 2026 18:24:00 -0500 Subject: [PATCH 3/3] docs: add draft GitHub issue for SIG Benchmarking collaboration proposal Draft issue proposing contributing NeuralNav to the llm-d SIG Benchmarking ecosystem and enabling collaboration with Config Explorer. Signed-off-by: Andre Fredette --- docs/SIG_BENCHMARKING_ISSUE.md | 76 ++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) create mode 100644 docs/SIG_BENCHMARKING_ISSUE.md diff --git a/docs/SIG_BENCHMARKING_ISSUE.md b/docs/SIG_BENCHMARKING_ISSUE.md new file mode 100644 index 0000000..b72ece2 --- /dev/null +++ b/docs/SIG_BENCHMARKING_ISSUE.md @@ -0,0 +1,76 @@ +# Proposal: Contribute NeuralNav to SIG Benchmarking and integrate with Config Explorer + +## Summary + +Contributors from the [Config Explorer](https://github.com/llm-d/llm-d-benchmark/tree/main/config_explorer) and [NeuralNav](https://github.com/redhat-et/neuralnav) teams have been collaborating and have identified strong technical synergies between the two projects. We'd like to propose contributing NeuralNav to the SIG Benchmarking ecosystem and working toward integration with Config Explorer. + +This is a joint collaboration proposal. Both projects are fully functional, independently valuable applications today. We envision them starting as independent peers within a shared repository and progressively integrating over time -- sharing components, aligning data formats, and eventually converging into a unified planning tool for llm-d deployments. A detailed proposal with integration architecture and roadmap is available in [Proposal: An llm-d-planner for rapid llm-d configuration planning](https://docs.google.com/document/d/1jnsWjjjxyVr1SjVVaZz5i378KHS6xx6HCNqHAPo3q8w/edit?usp=sharing). + +## Background + +**Config Explorer** helps users analyze GPU memory requirements, estimate performance using BentoML's roofline model, and explore llm-d benchmark results -- including P/D disaggregation configurations and scheduler parameter analysis. It answers: *"Can this model run on this GPU, and what performance can I expect?"* + +**NeuralNav** guides users from business requirements to production deployments. Users describe their use case in natural language, and NeuralNav extracts requirements, maps them to SLO targets and traffic profiles, queries benchmark data, ranks model+GPU configurations across multiple dimensions (accuracy, cost, latency, complexity), generates Kubernetes manifests, and deploys. It answers: *"Given my business needs, what should I deploy and how do I get it running?"* + +Both projects are working applications with Streamlit frontends, CLIs, and Python library APIs. Both help users interpret benchmark data and plan deployments -- but from different angles and with different capabilities. + +## Why the projects complement each other + +We've done a detailed integration analysis and found that the two projects have remarkably little overlap in their core logic: + +**Config Explorer provides capabilities NeuralNav lacks:** +- GPU memory estimation (weights, KV cache, activation) with attention-type awareness (MHA/GQA/MQA/MLA) +- Synthetic performance estimation via roofline model for any model on any GPU +- P/D disaggregation data model and llm-d scheduler parameter analysis +- Benchmark sweep visualization with Pareto-front analysis + +**NeuralNav provides capabilities Config Explorer lacks:** +- Natural language intent extraction (use case description to structured requirements) +- Use case to SLO mapping +- Model quality scoring +- Multi-criteria ranking across 4 dimensions with 5 ranked views +- YAML generation (KServe/vLLM/HPA/ServiceMonitor) +- Replica and scaling calculations + +The overlap is limited to two shared concepts -- SLO compliance filtering and GPU cost lookups -- both of which would benefit from a single source of truth. + +A few specific integration opportunities stand out: + +1. **Expanded coverage**: NeuralNav can only recommend configurations that have been empirically benchmarked. Config Explorer's roofline model could fill gaps with synthetic estimates, presented alongside real benchmark data with tiered confidence levels. + +2. **Quality-aware ranking at scale**: Config Explorer surfaces many (model, GPU) configurations but doesn't evaluate whether the model is good at the user's task. NeuralNav's multi-criteria scoring ensures expanded coverage doesn't come at the cost of recommendation quality. + +3. **P/D disaggregation**: NeuralNav currently supports only aggregated vLLM deployments. Config Explorer already models P/D disaggregation natively in its data schema -- this could become the foundation for llm-d-native deployment guidance rather than something that needs to be built from scratch. + +4. **Benchmark feedback loop**: NeuralNav's recommendations could identify high-value (model, GPU) combinations that lack empirical data, helping prioritize future benchmark runs. + +## Proposed collaboration + +We'd like to propose the following: + +1. **Contribute NeuralNav** to the SIG Benchmarking ecosystem as a complete working project. + +2. **Enable the two projects to leverage each other's technology** over time, starting with the areas of strongest synergy (benchmark data access, synthetic performance estimation, capacity planning). + +3. **Gradually extract shared components** as collaboration matures -- for example, benchmark data formats and loading, hardware/model specification logic, performance estimation, and visualization utilities. + +Initially, both projects would continue to function as independent applications -- contributing NeuralNav doesn't break existing functionality. Over time, as shared components are extracted and integration deepens, the two projects would progressively converge into a unified tool. + +## Proposed structure + +We propose to **create a new repository** (name TBD) under SIG Benchmarking to host both Config Explorer and NeuralNav. The reasoning: + +- **Keep llm-d-benchmark focused** on benchmarking infrastructure and benchmark results -- what it does well today. +- **Give applications their own home** -- Config Explorer and NeuralNav are both user-facing applications that help people *use* benchmark data. A dedicated repository would be a natural home for tools in this category. +- **Enable independent evolution** -- both projects can continue developing while integration work proceeds alongside them. +- **Facilitate shared component extraction** -- as common patterns emerge (data access, hardware specs, cost calculations), they can be factored out within the same repository. + +## Next steps / discussion + +We'd welcome feedback from SIG maintainers and the broader community on: + +- **Interest**: Is this kind of collaboration something the SIG would like to pursue? +- **Repository structure**: New repo vs. hosting within llm-d-benchmark -- what makes sense for the project? +- **Integration priorities**: Which integration points are most valuable to the community? + +We'd be happy to demo NeuralNav at an upcoming SIG meeting and answer any questions about the proposal.