Skip to content

docs: add Config Explorer / NeuralNav collaboration analysis and SIG proposal#104

Closed
anfredette wants to merge 217 commits intollm-d-incubation:mainfrom
anfredette:config-exlorer-eval
Closed

docs: add Config Explorer / NeuralNav collaboration analysis and SIG proposal#104
anfredette wants to merge 217 commits intollm-d-incubation:mainfrom
anfredette:config-exlorer-eval

Conversation

@anfredette
Copy link
Copy Markdown
Collaborator

@anfredette anfredette commented Mar 3, 2026

Docs Include

  • Comparative technical evaluation of config_explorer and NeuralNav
  • Integration proposal analyzing complementary capabilities and synergy opportunities
  • Draft GitHub issue for proposing collaboration to SIG Benchmarking

Context

These documents support an upcoming proposal to the llm-d SIG Benchmarking
community to contribute NeuralNav and explore integration with Config Explorer.

NOTES:

  • This PR is created for discussion purposes and may or may not get merged into the repo.
  • CONFIG_EXPLORER_EVALUATION.md was produced almost entirely by Claude Code for my own info. I didn't spend any time trying to make it suitable for public consumption, and it will likely not be merged.

Summary by CodeRabbit

  • Documentation
    • Added integration proposal documentation detailing project combination strategy.
    • Added comparative technical evaluation documentation with integration models and feasibility assessment.
    • Added benchmarking ecosystem contribution proposal with collaboration structure.

yuvalluria and others added 30 commits December 18, 2025 15:46
- Add usecase_quality_scorer.py: Loads use-case specific scores from Artificial Analysis
- Update model_evaluator.py: Uses quality scorer (50pts), keeps Andre's latency/budget
- Update solution_scorer.py: score_accuracy() uses quality data, falls back to size
- Add weighted_scores CSVs for 9 use cases
- Add USE_CASE_METHODOLOGY.md, research data, BLIS benchmarks

Integration: Quality (Yuval) + Latency/Throughput (Andre)
Signed-off-by: Yuval Luria <yuval750@gmail.com>
Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Add PRIORITY_ADJUSTMENTS for SLO tuning (low_latency, balanced, cost_saving, high_throughput)
- Add USECASE_SLO_WORKLOAD configuration loading from JSON
- Add apply_priority_adjustment() function for dynamic SLO adjustment
- Streamlined 3-tab interface (Chat, Recommendation Details, Deployment Management)

This integrates Yuval's UI improvements with research-based SLO adjustments
based on Nielsen UX (1993), SCORPIO, vLLM, and GitHub Copilot studies.

Signed-off-by: Yuval Luria <yuval750@gmail.com>
- usecase_slo_workload.json: Research-based SLO targets per use case
- usecase_weights.json: Benchmark weights per use case
- all_usecases_config.json: Complete use case configurations

Required by Test_AA UI for PRIORITY_ADJUSTMENTS functionality.

Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Add TASK DATASETS section showing use-case benchmark weights
- Add Technical Spec display for optional fields
- Add Real BLIS Benchmark SLOs section
- Add Research-Based SLO TARGETS vs BLIS Actual comparison
- Add 206 model catalog browser
- Add editable SLO Targets and Workload Profile
- Full integration with Andre's BLIS benchmark data

This is the complete POC UI with all features from Test_AA.

Signed-off-by: Yuval Luria <yuval750@gmail.com>
Only recommend H100 GPUs from real BLIS benchmark data.
A100-80 was showing as default but is not the primary benchmark hardware.

Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Convert multiline f-string HTML templates to single-line strings
- This fixes raw HTML showing as text in Streamlit
- Affects render_slo_cards(), render_impact_factors(), render_impact_factors_mini()

Signed-off-by: Yuval Luria <yuval750@gmail.com>
MAJOR REFACTOR: New recommendation engine using real BLIS benchmarks

Changes:
- Add blis_recommendation() function with ACTUAL BLIS benchmark data
- Model+Hardware combinations scored together (41 unique combos)
- Restore A100-80 hardware (both H100 and A100-80 are real BLIS data)
- Quality scores from use-case CSVs (weighted_scores/)
- Latency/throughput from ACTUAL BLIS (ttft_p95, e2e_p95, tokens_per_second)
- Cost from hardware tier (cheaper = higher score)

Priority-based ranking:
- cost_saving: cheapest hardware meeting SLO for best models
- low_latency: fastest TTFT from BLIS data
- high_quality: best model quality with SLO-meeting hardware
- high_throughput: highest tokens/second from BLIS data
- balanced: weighted MCDM combination

Returns ranked list of Model+Hardware combinations with:
- Actual BLIS metrics (not interpolated)
- SLO compliance check
- Score breakdown (quality, latency, cost, throughput)

Signed-off-by: Yuval Luria <yuval750@gmail.com>
Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Fix blis_recommendation() to output 'recommendations' (not 'all_recommendations')
- Include hardware in model name (e.g., 'Llama 3.1 8B on H100 x2')
- Add blis_metrics to each recommendation (ttft_p95_ms, e2e_p95_ms, tokens_per_second)
- Add get_model_pros() and get_model_cons() based on ACTUAL BLIS data
- Score breakdown includes quality, latency, cost, capacity contributions

Now recommendations show:
- Model + Hardware combination
- ACTUAL BLIS latency metrics
- Hardware cost
- SLO compliance status

Signed-off-by: Yuval Luria <yuval750@gmail.com>
BUG: Code looked for data/benchmarks/benchmarks_BLIS.json
FIX: File is at data/benchmarks_BLIS.json

This was causing fallback to mock_recommendation() which shows
models without hardware configurations (gpt-oss-20B, Apriel, etc.)

Now BLIS data loads correctly and shows REAL Model+Hardware combos

Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Fixed path: data/benchmarks_BLIS.json (was data/benchmarks/benchmarks_BLIS.json)
- Now recommendations show Model+Hardware combinations with ACTUAL BLIS metrics
- Quality from AA CSVs + Latency/Throughput from BLIS = Complete recommendation

Signed-off-by: Yuval Luria <yuval750@gmail.com>
Signed-off-by: Yuval Luria <yuval750@gmail.com>
Integrate Yuval's quality scoring with Andre's latency benchmarks

Signed-off-by: Andre Fredette <afredette@redhat.com>
- Update LLM from llama3.1:8b to qwen2.5:7b
- Expand backend structure with detailed module descriptions
- Document data directory structure (47 models, 9 use cases, BLIS benchmarks)
- Add use case configs directory documentation
- Update Critical Data Collections with BLIS as benchmark source
- Clarify PostgreSQL usage for serving benchmark data

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
…ia-anfredette

docs: Update CLAUDE.md to reflect current codebase state

Signed-off-by: Andre Fredette <afredette@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
Merge merge-yuvalluria branch into main

Signed-off-by: Andre Fredette <afredette@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
…-flow

doc: Add document describing the current recommendation flow

Signed-off-by: Andre Fredette <afredette@redhat.com>
Move the Technical Spec (Optional Fields) card from a separate column
into the same row as SLO TARGETS, WORKLOAD PROFILE, and TASK DATASETS.
All four cards now share consistent styling and layout.

Signed-off-by: Andre Fredette <afredette@redhat.com>
- Move token length fields from SLO Targets to Workload Profile column
- Rename fields: Mean Prompt Tokens, Mean Output Tokens
- Add hover tooltips explaining values are determined by use case
- Remove unused Concurrency field
- Remove BLIS Config message from SLO Targets (now shown in Workload Profile)
- Remove dead functions: render_impact_factors_mini, render_impact_factors

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
QPS exceeding single-GPU benchmarks is handled by recommending more GPUs,
not by warning the user. Removed unnecessary QPS-related warning messages.

Signed-off-by: Andre Fredette <afredette@redhat.com>
refactor: Reorganize SLO cards layout and clean up workload UI
Replace UI-side optimal hardware recommendation with backend-driven
ranked recommendations that provide 5 sorted views:
- Balanced (weighted composite)
- Best Accuracy (model capability)
- Lowest Cost (price efficiency)
- Lowest Latency (SLO headroom)
- Simplest (deployment complexity)

Backend changes:
- Add /api/ranked-recommend-from-spec endpoint for spec-based ranking
- Add RankedRecommendationFromSpecRequest schema
- Add generate_ranked_recommendations_from_spec() workflow method

UI changes:
- Add fetch_ranked_recommendations() to call backend API
- Add render_weight_controls() with accuracy/cost/latency/simplicity weights
- Add render_ranked_recommendations() with unified table display
- Add expand/collapse toggle buttons for viewing alternatives
- Add "Re-Evaluate" button to apply configuration changes

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
…-display

feat: Add ranked recommendations with backend API integration
Restructure the contributing guide to have the complete Git workflow
in one place, from initial fork setup through submitting a PR. Previously
the steps were scattered across Getting Started, Development Workflow,
and Pull Request Guidelines sections.

Signed-off-by: Andre Fredette <afredette@redhat.com>
…uting

docs: Reorganize CONTRIBUTING.md with consolidated Git workflow
Signed-off-by: Amit Oren <amoren@redhat.com>
Signed-off-by: Amit Oren <amoren@redhat.com>
anfredette and others added 21 commits February 10, 2026 17:17
Delete UI code that duplicated backend-provided data:
- Remove get_raw_aa_accuracy(), load_weighted_scores(), and
  BENCHMARK_TO_AA_NAME_MAP — use backend scores.accuracy_score directly
- Remove calc_throughput() and USE_CASE_OUTPUT_TOKENS — use backend
  benchmark_metrics.tps_* directly
- Replace hardcoded priority_weights dict in winner dialog with actual
  user weights from session state
- Remove use_case_context and model_traits marketing copy dicts,
  replace with score-based summary

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
…eamlit

- Delete 1,259-line main CSS block, keep only deploy-button-hide rule
- Delete 6 inline CSS blocks (filter controls, SLO inputs, dialogs, etc.)
- Replace render_hero() with native st.image() + st.title() + st.caption()
- Replace render_score_bar() with st.progress() + native columns
- Replace carousel component (204 lines) with simple category cards
- Replace 11 section-header divs with st.subheader()
- Remove 3 JavaScript tab-switching blocks
- Strip all non-semantic hardcoded colors (~200 references), keeping only
  #ef4444 for errors/warnings and #10B981 for success indicators
- Remove hardcoded dark-mode backgrounds, gradients, and theme-specific borders
- Remove light-mode-only theme overrides from .streamlit/config.toml
- Fix invisible text from leftover -webkit-text-fill-color: transparent
- Fix 8 broken gradient remnants from partial CSS stripping
- Restore NeuralNav logo using st.image() in column layout

UI now relies on Streamlit's native theming for both light and dark mode.
File reduced from ~5,039 to ~3,218 lines.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
- Replace 27 individual session state init blocks with SESSION_DEFAULTS
  dict + loop
- Consolidate identical get_scores() / get_model_scores() into one
  module-level function
- Simplify dialog mutual-exclusion (Streamlit @st.dialog handles this)
- Extract render_slo_input() helper to replace 3 repeated TTFT/ITL/E2E
  input blocks
- Remove 3 dead switch_to_tab writes (JS consumers removed in Phase 4)
- Move per-metric percentile init into SESSION_DEFAULTS

File reduced from 3,218 to 3,117 lines.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
- Split render_top5_table() into _render_filter_summary(),
  _render_category_card(), and slim orchestrator
- Split render_slo_cards() into _render_slo_targets(),
  _render_workload_profile(), _render_accuracy_benchmarks(),
  _render_priorities(), and slim orchestrator
- Move TASK_DATASETS to module level
- Add prev/next navigation to category cards (< llm-d-incubation#1 of 5 >)
  allowing users to browse and select from all top-5 per category
- Remove dead render_best_card(), unused best_* variables,
  duplicate fetch_workload_profile() call, and cascading
  unused parameters through 3 functions
- Deduplicate 3 priority rows into a loop in _render_priorities()

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
Break the 3,035-line app.py into 10 focused files with no circular
dependencies. Remove ~260 lines of dead code (render_weight_controls,
render_ranked_recommendations, DATA_DIR, build_row, nested
format_gpu_config).

New structure:
  ui/app.py (447 lines) — entry point + tab routing
  ui/helpers.py — pure utility functions
  ui/state.py — session state defaults + init
  ui/api_client.py — all backend API calls
  ui/components/dialogs.py — winner/category/full-table dialogs
  ui/components/slo.py — SLO targets, workload, priorities
  ui/components/extraction.py — extraction display/approval/edit
  ui/components/recommendations.py — category cards, top-5 table
  ui/components/deployment.py — deployment tab

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
- Simplify UI layout, compact cards, auto-tab navigation
- Reduce top whitespace, hide Deploy button via toolbarMode=viewer
- Align hero icon/title, rename buttons across workflow
- Replace HTML extraction grid with compact markdown display
- Redesign recommendation cards with Solution/Scores/Values lines
- Add circular navigation and clear selection on navigate
- Auto-switch tabs on Generate Specification / Generate Recommendations
- Remove RPS warnings and "Want Different Results?" section
- Add Priorities column headers, update label text in Technical Spec
- Clear deployment state when starting new analysis

Signed-off-by: Andre Fredette <afredette@redhat.com>
Auto-fixed import sorting (I001), unused imports (F401), extraneous
f-string prefix (F541), and Optional type annotation (UP007).
Manually fixed set comprehension (C401), combined with-statements
(SIM117), and unused variable (F841).

Signed-off-by: Andre Fredette <afredette@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
Simplify UI: remove ~3,000 lines of complexity, split into modules
Promote ARCHITECTUREv2.md to the canonical ARCHITECTURE.md and move
the original to archive/ARCHITECTUREv1.md. Update the internal
cross-reference link to match the new filename.

Signed-off-by: Andre Fredette <afredette@redhat.com>
docs: Rename ARCHITECTUREv2 to ARCHITECTURE and archive v1
Enable remote benchmark data management without shell access.
Adds three API endpoints (GET /db/status, POST /db/upload-benchmarks,
POST /db/reset) and a Configuration tab in the UI for uploading
benchmark JSON files, viewing database statistics, and resetting data.

Extracts shared loading logic into knowledge_base/loader.py, used by
the CLI script, API endpoints, and UI.Includes config_id deduplication with requests_per_second for correctduplicate detection.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
After the PyPA src layout refactoring, BACKEND_DIR was never defined,
breaking make lint/format/test targets. Rename to SRC_DIR to match
the actual directory name and fix lint/format glob paths.

Signed-off-by: Andre Fredette <afredette@redhat.com>
feat: add database management via REST API and UI Configuration tab
…oling

Add complete Kubernetes manifests for deploying NeuralNav to OpenShift:

- Namespace, Secrets, and OpenShift Routes (TLS edge termination)
- PostgreSQL 16 with PVC-backed persistent storage
- Ollama with startup model pull (granite3.3:2b) and 20Gi PVC for models
- Backend API with environment-driven config (DATABASE_URL, OLLAMA_HOST,
  OLLAMA_MODEL) and health/readiness probes
- Streamlit UI with Streamlit-native health probes
- db-init Job that waits for PostgreSQL, initializes the schema, and
  loads all benchmark datasets (with bounded retry loop)
- deploy-all.sh orchestration script for ordered deployment via oc

Dockerfile updates for OpenShift compatibility:
- Pin platform to linux/amd64 for consistent cross-platform builds
- Run uvicorn directly via PATH instead of uv run
- Include scripts/ and README.md in image for db-init and hatchling
- Healthcheck validates HTTP status with raise_for_status()
- Restrict generated_configs/logs directory permissions to 750

OllamaClient refactored to use ollama.Client instance (required for
connecting to remote Ollama hosts in K8s) with OLLAMA_MODEL and
OLLAMA_HOST environment variable support.

Also adds Makefile docker-push target and increases the UI extract
timeout to 300s for CPU-only Ollama inference.

Co-authored-by: Andre Fredette <afredette@redhat.com>

Signed-off-by: Amit Oren <amoren@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
Add dedicated targets to build and push the backend Docker image to
quay.io, following the same pattern as the existing simulator targets.

Signed-off-by: Amit Oren <amoren@redhat.com>
…ackend-targets

feat: add build-backend and push-backend Makefile targets
Structured technical evaluation covering functional comparison,
architectural analysis, overlap/complementarity assessment, and
integration feasibility between llm-d-benchmark's config_explorer
and NeuralNav. Proposes offline synthetic benchmark generation as
the recommended integration path.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
Makes the case for merging both projects into a single upstream
tool in the llm-d organization. Covers strategic rationale,
unified architecture, phased integration roadmap, and risk
analysis with mitigations.

Signed-off-by: Andre Fredette <afredette@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 3, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ef4b740d-bb05-43c7-b497-d4464e9d35fc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Three new documentation files propose and evaluate the integration of config_explorer and NeuralNav projects. The documents include architectural comparisons, integration strategies, feasibility assessments, and governance models for combining the two tools into a unified upstream project or ecosystem collaboration.

Changes

Cohort / File(s) Summary
Integration Proposal & Evaluation Documentation
docs/CE_NN_INTEGRATION_PROPOSAL.md, docs/CONFIG_EXPLORER_EVALUATION.md, docs/SIG_BENCHMARKING_ISSUE.md
Three new proposal documents detailing the integration of config_explorer and NeuralNav. Includes technical evaluation comparing functional capabilities, data models, and APIs; integration rationale covering guided workflows, evaluation expansion, and ranking; feasibility assessment with multiple integration models (offline generation, loose coupling, full merge); risk analysis; and governance proposals for SIG Benchmarking collaboration. No code or API changes introduced.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main changes: adding documentation files that present a collaborative analysis between Config Explorer and NeuralNav, plus a proposal to the SIG Benchmarking community.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/CE_NN_INTEGRATION_PROPOSAL.md (1)

36-53: Make comparative capability claims auditable.

The capability matrix uses definitive “None/Shared” overlap labels, but doesn’t point to evidence boundaries (code refs, benchmark artifacts, or date/commit scope). For a governance-facing doc, this makes technical conclusions hard to validate and maintain over time.

A compact “Evidence & scope” note under the table (sources + snapshot date) would make the claims reviewable.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CE_NN_INTEGRATION_PROPOSAL.md` around lines 36 - 53, Add an "Evidence &
scope" note immediately under the capability matrix that lists the data sources
and snapshot scope used to label overlaps (e.g., code references, benchmark
artifact names, and commit hashes or dates) and a short statement of the review
cutoff date; explicitly map ambiguous labels ("None"/"Shared") to the evidence
(for example: "None — verified by absence of features in config_explorer repo at
commit abc123 and no benchmarking artifacts; Shared — both projects reference
GPU cost lookup data source X dated 2025-01-15"). Ensure the note is compact
(1–3 lines) and includes at least one pointer to code/benchmark artifacts and
the snapshot date so the claims become auditable and maintainable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/CE_NN_INTEGRATION_PROPOSAL.md`:
- Line 1: The proposal currently presents a unified upstream project but
conflicts with the alternative independent-collaboration approach in
docs/SIG_BENCHMARKING_ISSUE.md; add a short "Positioning" section at the top of
CE_NN_INTEGRATION_PROPOSAL.md (near the title) that explicitly states whether
the document advocates (1) the preferred direction of a single unified upstream
project or (2) an alternative option to pursue independent collaboration between
config_explorer and NeuralNav, and briefly summarize the rationale and
implications for SIG review; also update the corresponding paragraphs around
lines referenced (15-16 and 61-67) to reflect this positioning so the narrative
is consistent with SIG_BENCHMARKING_ISSUE.md.

In `@docs/CONFIG_EXPLORER_EVALUATION.md`:
- Around line 331-334: Tighten the POC acceptance thresholds in the "Accuracy"
criterion: change the aggregate pass-rate requirement from ">60% of tested
combos within 30%" to a stricter rule such as "≥80% of tested (model, GPU,
traffic_profile) combos within ±20%" and add per-traffic_profile minimums (e.g.,
each traffic_profile must meet ≥75% pass rate). For p95-sensitive metrics
(TTFT_p95, ITL_p95, E2E_p95) tighten the allowable error band to ±15% and
require these metrics individually to pass the threshold (not just aggregate).
Remove or narrow the caveat allowing 30–50% error—replace with a hard fail
threshold (e.g., any metric >25% error flagged as fail) and require documented
acceptance justification for any exceptions. Ensure these changes apply
alongside the "Coverage expansion" and "End-to-end flow" checks so synthetic
benchmarks only proceed when both aggregate and per-profile/p95 conditions are
met.

---

Nitpick comments:
In `@docs/CE_NN_INTEGRATION_PROPOSAL.md`:
- Around line 36-53: Add an "Evidence & scope" note immediately under the
capability matrix that lists the data sources and snapshot scope used to label
overlaps (e.g., code references, benchmark artifact names, and commit hashes or
dates) and a short statement of the review cutoff date; explicitly map ambiguous
labels ("None"/"Shared") to the evidence (for example: "None — verified by
absence of features in config_explorer repo at commit abc123 and no benchmarking
artifacts; Shared — both projects reference GPU cost lookup data source X dated
2025-01-15"). Ensure the note is compact (1–3 lines) and includes at least one
pointer to code/benchmark artifacts and the snapshot date so the claims become
auditable and maintainable.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between ad7c395 and 1bfb3bb.

📒 Files selected for processing (3)
  • docs/CE_NN_INTEGRATION_PROPOSAL.md
  • docs/CONFIG_EXPLORER_EVALUATION.md
  • docs/SIG_BENCHMARKING_ISSUE.md

@@ -0,0 +1,103 @@
# Proposal: Unifying config_explorer and NeuralNav into a Single Upstream Project
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align this document’s strategy with the SIG proposal narrative.

This file positions a single unified upstream project, while docs/SIG_BENCHMARKING_ISSUE.md explicitly proposes collaboration with both tools remaining independent. That strategic mismatch will confuse stakeholders during SIG review and weakens the proposal package coherence.

Consider adding a short “Positioning” section near Line 1 that clearly states whether this is:

  1. the preferred direction, or
  2. an alternative to the independent-collaboration path in the SIG issue.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 15-16, 61-67

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CE_NN_INTEGRATION_PROPOSAL.md` at line 1, The proposal currently
presents a unified upstream project but conflicts with the alternative
independent-collaboration approach in docs/SIG_BENCHMARKING_ISSUE.md; add a
short "Positioning" section at the top of CE_NN_INTEGRATION_PROPOSAL.md (near
the title) that explicitly states whether the document advocates (1) the
preferred direction of a single unified upstream project or (2) an alternative
option to pursue independent collaboration between config_explorer and
NeuralNav, and briefly summarize the rationale and implications for SIG review;
also update the corresponding paragraphs around lines referenced (15-16 and
61-67) to reflect this positioning so the narrative is consistent with
SIG_BENCHMARKING_ISSUE.md.

Comment on lines +331 to +334
1. **Accuracy**: For (model, GPU, traffic_profile) combos where NeuralNav has real GuideLLM benchmarks, the roofline estimates agree within 30% on TTFT, ITL, and E2E latency for the majority (>60%) of tested combos.
2. **Coverage expansion**: The synthetic generator produces valid estimates for at least 3 model+GPU combos that NeuralNav currently cannot recommend.
3. **End-to-end flow**: Synthetic benchmarks loaded into NeuralNav are scored, ranked, and displayed with an "Estimated" indicator -- no existing recommendation quality degraded.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

POC acceptance thresholds are too loose for ranking-quality decisions.

Proceeding when latency estimates are within 30% for just >60% of combos (and still allowing 30–50% error with caveats) risks introducing noisy synthetic data into recommendation ranking. That can degrade trust in “best” outputs.

Please tighten go/no-go criteria (e.g., stricter error bands for p95-sensitive metrics and minimum pass rates per traffic profile), not only aggregate pass rates.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 373-375

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CONFIG_EXPLORER_EVALUATION.md` around lines 331 - 334, Tighten the POC
acceptance thresholds in the "Accuracy" criterion: change the aggregate
pass-rate requirement from ">60% of tested combos within 30%" to a stricter rule
such as "≥80% of tested (model, GPU, traffic_profile) combos within ±20%" and
add per-traffic_profile minimums (e.g., each traffic_profile must meet ≥75% pass
rate). For p95-sensitive metrics (TTFT_p95, ITL_p95, E2E_p95) tighten the
allowable error band to ±15% and require these metrics individually to pass the
threshold (not just aggregate). Remove or narrow the caveat allowing 30–50%
error—replace with a hard fail threshold (e.g., any metric >25% error flagged as
fail) and require documented acceptance justification for any exceptions. Ensure
these changes apply alongside the "Coverage expansion" and "End-to-end flow"
checks so synthetic benchmarks only proceed when both aggregate and
per-profile/p95 conditions are met.

@anfredette anfredette marked this pull request as draft March 3, 2026 23:45
Draft issue proposing contributing NeuralNav to the llm-d SIG
Benchmarking ecosystem and enabling collaboration with Config Explorer.

Signed-off-by: Andre Fredette <afredette@redhat.com>
@anfredette anfredette force-pushed the config-exlorer-eval branch from 1bfb3bb to d8f55f3 Compare March 4, 2026 23:23
@anfredette anfredette closed this Mar 26, 2026
@anfredette anfredette deleted the config-exlorer-eval branch March 27, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants