docs: add Config Explorer / NeuralNav collaboration analysis and SIG proposal by anfredette · Pull Request #104 · llm-d-incubation/llm-d-planner

anfredette · 2026-03-03T23:32:50Z

Docs Include

Comparative technical evaluation of config_explorer and NeuralNav
Integration proposal analyzing complementary capabilities and synergy opportunities
Draft GitHub issue for proposing collaboration to SIG Benchmarking

Context

These documents support an upcoming proposal to the llm-d SIG Benchmarking
community to contribute NeuralNav and explore integration with Config Explorer.

NOTES:

This PR is created for discussion purposes and may or may not get merged into the repo.
CONFIG_EXPLORER_EVALUATION.md was produced almost entirely by Claude Code for my own info. I didn't spend any time trying to make it suitable for public consumption, and it will likely not be merged.

Summary by CodeRabbit

Documentation
- Added integration proposal documentation detailing project combination strategy.
- Added comparative technical evaluation documentation with integration models and feasibility assessment.
- Added benchmarking ecosystem contribution proposal with collaboration structure.

- Add usecase_quality_scorer.py: Loads use-case specific scores from Artificial Analysis - Update model_evaluator.py: Uses quality scorer (50pts), keeps Andre's latency/budget - Update solution_scorer.py: score_accuracy() uses quality data, falls back to size - Add weighted_scores CSVs for 9 use cases - Add USE_CASE_METHODOLOGY.md, research data, BLIS benchmarks Integration: Quality (Yuval) + Latency/Throughput (Andre) Signed-off-by: Yuval Luria <yuval750@gmail.com>

Signed-off-by: Yuval Luria <yuval750@gmail.com>

- Add PRIORITY_ADJUSTMENTS for SLO tuning (low_latency, balanced, cost_saving, high_throughput) - Add USECASE_SLO_WORKLOAD configuration loading from JSON - Add apply_priority_adjustment() function for dynamic SLO adjustment - Streamlined 3-tab interface (Chat, Recommendation Details, Deployment Management) This integrates Yuval's UI improvements with research-based SLO adjustments based on Nielsen UX (1993), SCORPIO, vLLM, and GitHub Copilot studies. Signed-off-by: Yuval Luria <yuval750@gmail.com>

- usecase_slo_workload.json: Research-based SLO targets per use case - usecase_weights.json: Benchmark weights per use case - all_usecases_config.json: Complete use case configurations Required by Test_AA UI for PRIORITY_ADJUSTMENTS functionality. Signed-off-by: Yuval Luria <yuval750@gmail.com>

- Add TASK DATASETS section showing use-case benchmark weights - Add Technical Spec display for optional fields - Add Real BLIS Benchmark SLOs section - Add Research-Based SLO TARGETS vs BLIS Actual comparison - Add 206 model catalog browser - Add editable SLO Targets and Workload Profile - Full integration with Andre's BLIS benchmark data This is the complete POC UI with all features from Test_AA. Signed-off-by: Yuval Luria <yuval750@gmail.com>

Only recommend H100 GPUs from real BLIS benchmark data. A100-80 was showing as default but is not the primary benchmark hardware. Signed-off-by: Yuval Luria <yuval750@gmail.com>

- Convert multiline f-string HTML templates to single-line strings - This fixes raw HTML showing as text in Streamlit - Affects render_slo_cards(), render_impact_factors(), render_impact_factors_mini() Signed-off-by: Yuval Luria <yuval750@gmail.com>

MAJOR REFACTOR: New recommendation engine using real BLIS benchmarks Changes: - Add blis_recommendation() function with ACTUAL BLIS benchmark data - Model+Hardware combinations scored together (41 unique combos) - Restore A100-80 hardware (both H100 and A100-80 are real BLIS data) - Quality scores from use-case CSVs (weighted_scores/) - Latency/throughput from ACTUAL BLIS (ttft_p95, e2e_p95, tokens_per_second) - Cost from hardware tier (cheaper = higher score) Priority-based ranking: - cost_saving: cheapest hardware meeting SLO for best models - low_latency: fastest TTFT from BLIS data - high_quality: best model quality with SLO-meeting hardware - high_throughput: highest tokens/second from BLIS data - balanced: weighted MCDM combination Returns ranked list of Model+Hardware combinations with: - Actual BLIS metrics (not interpolated) - SLO compliance check - Score breakdown (quality, latency, cost, throughput) Signed-off-by: Yuval Luria <yuval750@gmail.com>

Signed-off-by: Yuval Luria <yuval750@gmail.com>

- Fix blis_recommendation() to output 'recommendations' (not 'all_recommendations') - Include hardware in model name (e.g., 'Llama 3.1 8B on H100 x2') - Add blis_metrics to each recommendation (ttft_p95_ms, e2e_p95_ms, tokens_per_second) - Add get_model_pros() and get_model_cons() based on ACTUAL BLIS data - Score breakdown includes quality, latency, cost, capacity contributions Now recommendations show: - Model + Hardware combination - ACTUAL BLIS latency metrics - Hardware cost - SLO compliance status Signed-off-by: Yuval Luria <yuval750@gmail.com>

BUG: Code looked for data/benchmarks/benchmarks_BLIS.json FIX: File is at data/benchmarks_BLIS.json This was causing fallback to mock_recommendation() which shows models without hardware configurations (gpt-oss-20B, Apriel, etc.) Now BLIS data loads correctly and shows REAL Model+Hardware combos Signed-off-by: Yuval Luria <yuval750@gmail.com>

- Fixed path: data/benchmarks_BLIS.json (was data/benchmarks/benchmarks_BLIS.json) - Now recommendations show Model+Hardware combinations with ACTUAL BLIS metrics - Quality from AA CSVs + Latency/Throughput from BLIS = Complete recommendation Signed-off-by: Yuval Luria <yuval750@gmail.com>

Signed-off-by: Yuval Luria <yuval750@gmail.com>

Integrate Yuval's quality scoring with Andre's latency benchmarks Signed-off-by: Andre Fredette <afredette@redhat.com>

- Update LLM from llama3.1:8b to qwen2.5:7b - Expand backend structure with detailed module descriptions - Document data directory structure (47 models, 9 use cases, BLIS benchmarks) - Add use case configs directory documentation - Update Critical Data Collections with BLIS as benchmark source - Clarify PostgreSQL usage for serving benchmark data Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

…ia-anfredette docs: Update CLAUDE.md to reflect current codebase state Signed-off-by: Andre Fredette <afredette@redhat.com>

Signed-off-by: Andre Fredette <afredette@redhat.com>

Merge merge-yuvalluria branch into main Signed-off-by: Andre Fredette <afredette@redhat.com>

Signed-off-by: Andre Fredette <afredette@redhat.com>

…-flow doc: Add document describing the current recommendation flow Signed-off-by: Andre Fredette <afredette@redhat.com>

Move the Technical Spec (Optional Fields) card from a separate column into the same row as SLO TARGETS, WORKLOAD PROFILE, and TASK DATASETS. All four cards now share consistent styling and layout. Signed-off-by: Andre Fredette <afredette@redhat.com>

- Move token length fields from SLO Targets to Workload Profile column - Rename fields: Mean Prompt Tokens, Mean Output Tokens - Add hover tooltips explaining values are determined by use case - Remove unused Concurrency field - Remove BLIS Config message from SLO Targets (now shown in Workload Profile) - Remove dead functions: render_impact_factors_mini, render_impact_factors Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

QPS exceeding single-GPU benchmarks is handled by recommending more GPUs, not by warning the user. Removed unnecessary QPS-related warning messages. Signed-off-by: Andre Fredette <afredette@redhat.com>

refactor: Reorganize SLO cards layout and clean up workload UI

Replace UI-side optimal hardware recommendation with backend-driven ranked recommendations that provide 5 sorted views: - Balanced (weighted composite) - Best Accuracy (model capability) - Lowest Cost (price efficiency) - Lowest Latency (SLO headroom) - Simplest (deployment complexity) Backend changes: - Add /api/ranked-recommend-from-spec endpoint for spec-based ranking - Add RankedRecommendationFromSpecRequest schema - Add generate_ranked_recommendations_from_spec() workflow method UI changes: - Add fetch_ranked_recommendations() to call backend API - Add render_weight_controls() with accuracy/cost/latency/simplicity weights - Add render_ranked_recommendations() with unified table display - Add expand/collapse toggle buttons for viewing alternatives - Add "Re-Evaluate" button to apply configuration changes Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

…-display feat: Add ranked recommendations with backend API integration

Restructure the contributing guide to have the complete Git workflow in one place, from initial fork setup through submitting a PR. Previously the steps were scattered across Getting Started, Development Workflow, and Pull Request Guidelines sections. Signed-off-by: Andre Fredette <afredette@redhat.com>

…uting docs: Reorganize CONTRIBUTING.md with consolidated Git workflow

Signed-off-by: Amit Oren <amoren@redhat.com>

Delete UI code that duplicated backend-provided data: - Remove get_raw_aa_accuracy(), load_weighted_scores(), and BENCHMARK_TO_AA_NAME_MAP — use backend scores.accuracy_score directly - Remove calc_throughput() and USE_CASE_OUTPUT_TOKENS — use backend benchmark_metrics.tps_* directly - Replace hardcoded priority_weights dict in winner dialog with actual user weights from session state - Remove use_case_context and model_traits marketing copy dicts, replace with score-based summary Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

…eamlit - Delete 1,259-line main CSS block, keep only deploy-button-hide rule - Delete 6 inline CSS blocks (filter controls, SLO inputs, dialogs, etc.) - Replace render_hero() with native st.image() + st.title() + st.caption() - Replace render_score_bar() with st.progress() + native columns - Replace carousel component (204 lines) with simple category cards - Replace 11 section-header divs with st.subheader() - Remove 3 JavaScript tab-switching blocks - Strip all non-semantic hardcoded colors (~200 references), keeping only #ef4444 for errors/warnings and #10B981 for success indicators - Remove hardcoded dark-mode backgrounds, gradients, and theme-specific borders - Remove light-mode-only theme overrides from .streamlit/config.toml - Fix invisible text from leftover -webkit-text-fill-color: transparent - Fix 8 broken gradient remnants from partial CSS stripping - Restore NeuralNav logo using st.image() in column layout UI now relies on Streamlit's native theming for both light and dark mode. File reduced from ~5,039 to ~3,218 lines. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

- Replace 27 individual session state init blocks with SESSION_DEFAULTS dict + loop - Consolidate identical get_scores() / get_model_scores() into one module-level function - Simplify dialog mutual-exclusion (Streamlit @st.dialog handles this) - Extract render_slo_input() helper to replace 3 repeated TTFT/ITL/E2E input blocks - Remove 3 dead switch_to_tab writes (JS consumers removed in Phase 4) - Move per-metric percentile init into SESSION_DEFAULTS File reduced from 3,218 to 3,117 lines. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

- Split render_top5_table() into _render_filter_summary(), _render_category_card(), and slim orchestrator - Split render_slo_cards() into _render_slo_targets(), _render_workload_profile(), _render_accuracy_benchmarks(), _render_priorities(), and slim orchestrator - Move TASK_DATASETS to module level - Add prev/next navigation to category cards (< llm-d-incubation#1 of 5 >) allowing users to browse and select from all top-5 per category - Remove dead render_best_card(), unused best_* variables, duplicate fetch_workload_profile() call, and cascading unused parameters through 3 functions - Deduplicate 3 priority rows into a loop in _render_priorities() Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

Break the 3,035-line app.py into 10 focused files with no circular dependencies. Remove ~260 lines of dead code (render_weight_controls, render_ranked_recommendations, DATA_DIR, build_row, nested format_gpu_config). New structure: ui/app.py (447 lines) — entry point + tab routing ui/helpers.py — pure utility functions ui/state.py — session state defaults + init ui/api_client.py — all backend API calls ui/components/dialogs.py — winner/category/full-table dialogs ui/components/slo.py — SLO targets, workload, priorities ui/components/extraction.py — extraction display/approval/edit ui/components/recommendations.py — category cards, top-5 table ui/components/deployment.py — deployment tab Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

- Simplify UI layout, compact cards, auto-tab navigation - Reduce top whitespace, hide Deploy button via toolbarMode=viewer - Align hero icon/title, rename buttons across workflow - Replace HTML extraction grid with compact markdown display - Redesign recommendation cards with Solution/Scores/Values lines - Add circular navigation and clear selection on navigate - Auto-switch tabs on Generate Specification / Generate Recommendations - Remove RPS warnings and "Want Different Results?" section - Add Priorities column headers, update label text in Technical Spec - Clear deployment state when starting new analysis Signed-off-by: Andre Fredette <afredette@redhat.com>

Auto-fixed import sorting (I001), unused imports (F401), extraneous f-string prefix (F541), and Optional type annotation (UP007). Manually fixed set comprehension (C401), combined with-statements (SIM117), and unused variable (F841). Signed-off-by: Andre Fredette <afredette@redhat.com>

Signed-off-by: Andre Fredette <afredette@redhat.com>

Simplify UI: remove ~3,000 lines of complexity, split into modules

Promote ARCHITECTUREv2.md to the canonical ARCHITECTURE.md and move the original to archive/ARCHITECTUREv1.md. Update the internal cross-reference link to match the new filename. Signed-off-by: Andre Fredette <afredette@redhat.com>

docs: Rename ARCHITECTUREv2 to ARCHITECTURE and archive v1

Enable remote benchmark data management without shell access. Adds three API endpoints (GET /db/status, POST /db/upload-benchmarks, POST /db/reset) and a Configuration tab in the UI for uploading benchmark JSON files, viewing database statistics, and resetting data. Extracts shared loading logic into knowledge_base/loader.py, used by the CLI script, API endpoints, and UI.Includes config_id deduplication with requests_per_second for correctduplicate detection. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

After the PyPA src layout refactoring, BACKEND_DIR was never defined, breaking make lint/format/test targets. Rename to SRC_DIR to match the actual directory name and fix lint/format glob paths. Signed-off-by: Andre Fredette <afredette@redhat.com>

feat: add database management via REST API and UI Configuration tab

…oling Add complete Kubernetes manifests for deploying NeuralNav to OpenShift: - Namespace, Secrets, and OpenShift Routes (TLS edge termination) - PostgreSQL 16 with PVC-backed persistent storage - Ollama with startup model pull (granite3.3:2b) and 20Gi PVC for models - Backend API with environment-driven config (DATABASE_URL, OLLAMA_HOST, OLLAMA_MODEL) and health/readiness probes - Streamlit UI with Streamlit-native health probes - db-init Job that waits for PostgreSQL, initializes the schema, and loads all benchmark datasets (with bounded retry loop) - deploy-all.sh orchestration script for ordered deployment via oc Dockerfile updates for OpenShift compatibility: - Pin platform to linux/amd64 for consistent cross-platform builds - Run uvicorn directly via PATH instead of uv run - Include scripts/ and README.md in image for db-init and hatchling - Healthcheck validates HTTP status with raise_for_status() - Restrict generated_configs/logs directory permissions to 750 OllamaClient refactored to use ollama.Client instance (required for connecting to remote Ollama hosts in K8s) with OLLAMA_MODEL and OLLAMA_HOST environment variable support. Also adds Makefile docker-push target and increases the UI extract timeout to 300s for CPU-only Ollama inference. Co-authored-by: Andre Fredette <afredette@redhat.com> Signed-off-by: Amit Oren <amoren@redhat.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

Feature: k8s deployment

docs: Add agentic design document

Add dedicated targets to build and push the backend Docker image to quay.io, following the same pattern as the existing simulator targets. Signed-off-by: Amit Oren <amoren@redhat.com>

…ackend-targets feat: add build-backend and push-backend Makefile targets

Structured technical evaluation covering functional comparison, architectural analysis, overlap/complementarity assessment, and integration feasibility between llm-d-benchmark's config_explorer and NeuralNav. Proposes offline synthetic benchmark generation as the recommended integration path. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>

Makes the case for merging both projects into a single upstream tool in the llm-d organization. Covers strategic rationale, unified architecture, phased integration roadmap, and risk analysis with mitigations. Signed-off-by: Andre Fredette <afredette@redhat.com>

coderabbitai · 2026-03-03T23:33:08Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ef4b740d-bb05-43c7-b497-d4464e9d35fc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Three new documentation files propose and evaluate the integration of config_explorer and NeuralNav projects. The documents include architectural comparisons, integration strategies, feasibility assessments, and governance models for combining the two tools into a unified upstream project or ecosystem collaboration.

Changes

Cohort / File(s)	Summary
Integration Proposal & Evaluation Documentation `docs/CE_NN_INTEGRATION_PROPOSAL.md`, `docs/CONFIG_EXPLORER_EVALUATION.md`, `docs/SIG_BENCHMARKING_ISSUE.md`	Three new proposal documents detailing the integration of config_explorer and NeuralNav. Includes technical evaluation comparing functional capabilities, data models, and APIs; integration rationale covering guided workflows, evaluation expansion, and ranking; feasibility assessment with multiple integration models (offline generation, loose coupling, full merge); risk analysis; and governance proposals for SIG Benchmarking collaboration. No code or API changes introduced.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main changes: adding documentation files that present a collaborative analysis between Config Explorer and NeuralNav, plus a proposal to the SIG Benchmarking community.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/CE_NN_INTEGRATION_PROPOSAL.md (1)
36-53: Make comparative capability claims auditable.

The capability matrix uses definitive “None/Shared” overlap labels, but doesn’t point to evidence boundaries (code refs, benchmark artifacts, or date/commit scope). For a governance-facing doc, this makes technical conclusions hard to validate and maintain over time.

A compact “Evidence & scope” note under the table (sources + snapshot date) would make the claims reviewable.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CE_NN_INTEGRATION_PROPOSAL.md` around lines 36 - 53, Add an "Evidence &
scope" note immediately under the capability matrix that lists the data sources
and snapshot scope used to label overlaps (e.g., code references, benchmark
artifact names, and commit hashes or dates) and a short statement of the review
cutoff date; explicitly map ambiguous labels ("None"/"Shared") to the evidence
(for example: "None — verified by absence of features in config_explorer repo at
commit abc123 and no benchmarking artifacts; Shared — both projects reference
GPU cost lookup data source X dated 2025-01-15"). Ensure the note is compact
(1–3 lines) and includes at least one pointer to code/benchmark artifacts and
the snapshot date so the claims become auditable and maintainable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/CE_NN_INTEGRATION_PROPOSAL.md`:
- Line 1: The proposal currently presents a unified upstream project but
conflicts with the alternative independent-collaboration approach in
docs/SIG_BENCHMARKING_ISSUE.md; add a short "Positioning" section at the top of
CE_NN_INTEGRATION_PROPOSAL.md (near the title) that explicitly states whether
the document advocates (1) the preferred direction of a single unified upstream
project or (2) an alternative option to pursue independent collaboration between
config_explorer and NeuralNav, and briefly summarize the rationale and
implications for SIG review; also update the corresponding paragraphs around
lines referenced (15-16 and 61-67) to reflect this positioning so the narrative
is consistent with SIG_BENCHMARKING_ISSUE.md.

In `@docs/CONFIG_EXPLORER_EVALUATION.md`:
- Around line 331-334: Tighten the POC acceptance thresholds in the "Accuracy"
criterion: change the aggregate pass-rate requirement from ">60% of tested
combos within 30%" to a stricter rule such as "≥80% of tested (model, GPU,
traffic_profile) combos within ±20%" and add per-traffic_profile minimums (e.g.,
each traffic_profile must meet ≥75% pass rate). For p95-sensitive metrics
(TTFT_p95, ITL_p95, E2E_p95) tighten the allowable error band to ±15% and
require these metrics individually to pass the threshold (not just aggregate).
Remove or narrow the caveat allowing 30–50% error—replace with a hard fail
threshold (e.g., any metric >25% error flagged as fail) and require documented
acceptance justification for any exceptions. Ensure these changes apply
alongside the "Coverage expansion" and "End-to-end flow" checks so synthetic
benchmarks only proceed when both aggregate and per-profile/p95 conditions are
met.

---

Nitpick comments:
In `@docs/CE_NN_INTEGRATION_PROPOSAL.md`:
- Around line 36-53: Add an "Evidence & scope" note immediately under the
capability matrix that lists the data sources and snapshot scope used to label
overlaps (e.g., code references, benchmark artifact names, and commit hashes or
dates) and a short statement of the review cutoff date; explicitly map ambiguous
labels ("None"/"Shared") to the evidence (for example: "None — verified by
absence of features in config_explorer repo at commit abc123 and no benchmarking
artifacts; Shared — both projects reference GPU cost lookup data source X dated
2025-01-15"). Ensure the note is compact (1–3 lines) and includes at least one
pointer to code/benchmark artifacts and the snapshot date so the claims become
auditable and maintainable.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between ad7c395 and 1bfb3bb.

📒 Files selected for processing (3)

docs/CE_NN_INTEGRATION_PROPOSAL.md
docs/CONFIG_EXPLORER_EVALUATION.md
docs/SIG_BENCHMARKING_ISSUE.md

coderabbitai · 2026-03-03T23:34:56Z

docs/CE_NN_INTEGRATION_PROPOSAL.md

@@ -0,0 +1,103 @@
+# Proposal: Unifying config_explorer and NeuralNav into a Single Upstream Project


⚠️ Potential issue | 🟠 Major

Align this document’s strategy with the SIG proposal narrative.

This file positions a single unified upstream project, while docs/SIG_BENCHMARKING_ISSUE.md explicitly proposes collaboration with both tools remaining independent. That strategic mismatch will confuse stakeholders during SIG review and weakens the proposal package coherence.

Consider adding a short “Positioning” section near Line 1 that clearly states whether this is:

the preferred direction, or

an alternative to the independent-collaboration path in the SIG issue.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 15-16, 61-67

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/CE_NN_INTEGRATION_PROPOSAL.md` at line 1, The proposal currently presents a unified upstream project but conflicts with the alternative independent-collaboration approach in docs/SIG_BENCHMARKING_ISSUE.md; add a short "Positioning" section at the top of CE_NN_INTEGRATION_PROPOSAL.md (near the title) that explicitly states whether the document advocates (1) the preferred direction of a single unified upstream project or (2) an alternative option to pursue independent collaboration between config_explorer and NeuralNav, and briefly summarize the rationale and implications for SIG review; also update the corresponding paragraphs around lines referenced (15-16 and 61-67) to reflect this positioning so the narrative is consistent with SIG_BENCHMARKING_ISSUE.md.

coderabbitai · 2026-03-03T23:34:56Z

docs/CONFIG_EXPLORER_EVALUATION.md

+1. **Accuracy**: For (model, GPU, traffic_profile) combos where NeuralNav has real GuideLLM benchmarks, the roofline estimates agree within 30% on TTFT, ITL, and E2E latency for the majority (>60%) of tested combos.
+2. **Coverage expansion**: The synthetic generator produces valid estimates for at least 3 model+GPU combos that NeuralNav currently cannot recommend.
+3. **End-to-end flow**: Synthetic benchmarks loaded into NeuralNav are scored, ranked, and displayed with an "Estimated" indicator -- no existing recommendation quality degraded.
+


⚠️ Potential issue | 🟠 Major

POC acceptance thresholds are too loose for ranking-quality decisions.

Proceeding when latency estimates are within 30% for just >60% of combos (and still allowing 30–50% error with caveats) risks introducing noisy synthetic data into recommendation ranking. That can degrade trust in “best” outputs.

Please tighten go/no-go criteria (e.g., stricter error bands for p95-sensitive metrics and minimum pass rates per traffic profile), not only aggregate pass rates.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 373-375

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/CONFIG_EXPLORER_EVALUATION.md` around lines 331 - 334, Tighten the POC acceptance thresholds in the "Accuracy" criterion: change the aggregate pass-rate requirement from ">60% of tested combos within 30%" to a stricter rule such as "≥80% of tested (model, GPU, traffic_profile) combos within ±20%" and add per-traffic_profile minimums (e.g., each traffic_profile must meet ≥75% pass rate). For p95-sensitive metrics (TTFT_p95, ITL_p95, E2E_p95) tighten the allowable error band to ±15% and require these metrics individually to pass the threshold (not just aggregate). Remove or narrow the caveat allowing 30–50% error—replace with a hard fail threshold (e.g., any metric >25% error flagged as fail) and require documented acceptance justification for any exceptions. Ensure these changes apply alongside the "Coverage expansion" and "End-to-end flow" checks so synthetic benchmarks only proceed when both aggregate and per-profile/p95 conditions are met.

Draft issue proposing contributing NeuralNav to the llm-d SIG Benchmarking ecosystem and enabling collaboration with Config Explorer. Signed-off-by: Andre Fredette <afredette@redhat.com>

yuvalluria and others added 30 commits December 18, 2025 15:46

Change LLM from Llama 3.1 to Qwen 2.5 7B

5311702

Signed-off-by: Yuval Luria <yuval750@gmail.com>

fix(ui): Remove A100-80 from hardware recommendations

d0b96b6

Only recommend H100 GPUs from real BLIS benchmark data. A100-80 was showing as default but is not the primary benchmark hardware. Signed-off-by: Yuval Luria <yuval750@gmail.com>

chore: Add plotly dependency to requirements.txt

35f5064

Signed-off-by: Yuval Luria <yuval750@gmail.com>

chore: Remove backup file

e74ddd5

Signed-off-by: Yuval Luria <yuval750@gmail.com>

Merge pull request llm-d-incubation#34 from yuvalluria/merge-yuvalluria

3fd7c55

Integrate Yuval's quality scoring with Andre's latency benchmarks Signed-off-by: Andre Fredette <afredette@redhat.com>

Merge pull request llm-d-incubation#36 from anfredette/merge-yuvallur…

5479070

…ia-anfredette docs: Update CLAUDE.md to reflect current codebase state Signed-off-by: Andre Fredette <afredette@redhat.com>

Merge remote-tracking branch 'upstream/main' into merge-yuvalluria

e9453b4

Signed-off-by: Andre Fredette <afredette@redhat.com>

Merge pull request llm-d-incubation#37 from anfredette/merge-yuvalluria

7129075

Merge merge-yuvalluria branch into main Signed-off-by: Andre Fredette <afredette@redhat.com>

doc: Add document describing the current recommendation flow

ae93ab8

Signed-off-by: Andre Fredette <afredette@redhat.com>

Merge pull request llm-d-incubation#38 from anfredette/recommendation…

63aca27

…-flow doc: Add document describing the current recommendation flow Signed-off-by: Andre Fredette <afredette@redhat.com>

refactor: Remove QPS validation warnings from workload insights

6c537ad

QPS exceeding single-GPU benchmarks is handled by recommending more GPUs, not by warning the user. Removed unnecessary QPS-related warning messages. Signed-off-by: Andre Fredette <afredette@redhat.com>

Merge pull request llm-d-incubation#40 from anfredette/post-merge-tweaks

508ba57

refactor: Reorganize SLO cards layout and clean up workload UI

Merge pull request llm-d-incubation#41 from anfredette/recommendation…

9752327

…-display feat: Add ranked recommendations with backend API integration

Merge pull request llm-d-incubation#42 from anfredette/update-contrib…

03becd5

…uting docs: Reorganize CONTRIBUTING.md with consolidated Git workflow

feat: use 'uv' for Python dependency management

4d281ce

Signed-off-by: Amit Oren <amoren@redhat.com>

feat: "Containerize" Compass workloads using Docker

34777a1

Signed-off-by: Amit Oren <amoren@redhat.com>

anfredette and others added 21 commits February 10, 2026 17:17

style: apply ruff format to ui/

a948064

Signed-off-by: Andre Fredette <afredette@redhat.com>

Merge pull request llm-d-incubation#90 from anfredette/refactor

cc6a8f9

Simplify UI: remove ~3,000 lines of complexity, split into modules

Merge pull request llm-d-incubation#92 from anfredette/architecture-doc

b6bc73f

docs: Rename ARCHITECTUREv2 to ARCHITECTURE and archive v1

Merge pull request llm-d-incubation#91 from anfredette/db-loader

a4f3f00

feat: add database management via REST API and UI Configuration tab

Merge pull request llm-d-incubation#95 from amito/feat/k8s-deployment

c1d5866

Feature: k8s deployment

Merge pull request llm-d-incubation#81 from anfredette/agentic-design

95f58b6

docs: Add agentic design document

feat: add build-backend and push-backend Makefile targets

f2f710d

Add dedicated targets to build and push the backend Docker image to quay.io, following the same pattern as the existing simulator targets. Signed-off-by: Amit Oren <amoren@redhat.com>

Merge pull request llm-d-incubation#101 from amito/feature/makefile-b…

ad7c395

…ackend-targets feat: add build-backend and push-backend Makefile targets

anfredette assigned amito Mar 3, 2026

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

anfredette marked this pull request as draft March 3, 2026 23:45

docs: add draft GitHub issue for SIG Benchmarking collaboration proposal

d8f55f3

Draft issue proposing contributing NeuralNav to the llm-d SIG Benchmarking ecosystem and enabling collaboration with Config Explorer. Signed-off-by: Andre Fredette <afredette@redhat.com>

anfredette force-pushed the config-exlorer-eval branch from 1bfb3bb to d8f55f3 Compare March 4, 2026 23:23

anfredette force-pushed the main branch from 3bff2da to 1d44e80 Compare March 26, 2026 15:48

anfredette closed this Mar 26, 2026

anfredette deleted the config-exlorer-eval branch March 27, 2026 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add Config Explorer / NeuralNav collaboration analysis and SIG proposal#104

docs: add Config Explorer / NeuralNav collaboration analysis and SIG proposal#104
anfredette wants to merge 217 commits intollm-d-incubation:mainfrom
anfredette:config-exlorer-eval

anfredette commented Mar 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,103 @@
		# Proposal: Unifying config_explorer and NeuralNav into a Single Upstream Project

Conversation

anfredette commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Docs Include

Context

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anfredette commented Mar 3, 2026 •

edited

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading