docs: add Config Explorer / NeuralNav collaboration analysis and SIG proposal#104
docs: add Config Explorer / NeuralNav collaboration analysis and SIG proposal#104anfredette wants to merge 217 commits intollm-d-incubation:mainfrom
Conversation
- Add usecase_quality_scorer.py: Loads use-case specific scores from Artificial Analysis - Update model_evaluator.py: Uses quality scorer (50pts), keeps Andre's latency/budget - Update solution_scorer.py: score_accuracy() uses quality data, falls back to size - Add weighted_scores CSVs for 9 use cases - Add USE_CASE_METHODOLOGY.md, research data, BLIS benchmarks Integration: Quality (Yuval) + Latency/Throughput (Andre) Signed-off-by: Yuval Luria <yuval750@gmail.com>
Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Add PRIORITY_ADJUSTMENTS for SLO tuning (low_latency, balanced, cost_saving, high_throughput) - Add USECASE_SLO_WORKLOAD configuration loading from JSON - Add apply_priority_adjustment() function for dynamic SLO adjustment - Streamlined 3-tab interface (Chat, Recommendation Details, Deployment Management) This integrates Yuval's UI improvements with research-based SLO adjustments based on Nielsen UX (1993), SCORPIO, vLLM, and GitHub Copilot studies. Signed-off-by: Yuval Luria <yuval750@gmail.com>
- usecase_slo_workload.json: Research-based SLO targets per use case - usecase_weights.json: Benchmark weights per use case - all_usecases_config.json: Complete use case configurations Required by Test_AA UI for PRIORITY_ADJUSTMENTS functionality. Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Add TASK DATASETS section showing use-case benchmark weights - Add Technical Spec display for optional fields - Add Real BLIS Benchmark SLOs section - Add Research-Based SLO TARGETS vs BLIS Actual comparison - Add 206 model catalog browser - Add editable SLO Targets and Workload Profile - Full integration with Andre's BLIS benchmark data This is the complete POC UI with all features from Test_AA. Signed-off-by: Yuval Luria <yuval750@gmail.com>
Only recommend H100 GPUs from real BLIS benchmark data. A100-80 was showing as default but is not the primary benchmark hardware. Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Convert multiline f-string HTML templates to single-line strings - This fixes raw HTML showing as text in Streamlit - Affects render_slo_cards(), render_impact_factors(), render_impact_factors_mini() Signed-off-by: Yuval Luria <yuval750@gmail.com>
MAJOR REFACTOR: New recommendation engine using real BLIS benchmarks Changes: - Add blis_recommendation() function with ACTUAL BLIS benchmark data - Model+Hardware combinations scored together (41 unique combos) - Restore A100-80 hardware (both H100 and A100-80 are real BLIS data) - Quality scores from use-case CSVs (weighted_scores/) - Latency/throughput from ACTUAL BLIS (ttft_p95, e2e_p95, tokens_per_second) - Cost from hardware tier (cheaper = higher score) Priority-based ranking: - cost_saving: cheapest hardware meeting SLO for best models - low_latency: fastest TTFT from BLIS data - high_quality: best model quality with SLO-meeting hardware - high_throughput: highest tokens/second from BLIS data - balanced: weighted MCDM combination Returns ranked list of Model+Hardware combinations with: - Actual BLIS metrics (not interpolated) - SLO compliance check - Score breakdown (quality, latency, cost, throughput) Signed-off-by: Yuval Luria <yuval750@gmail.com>
Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Fix blis_recommendation() to output 'recommendations' (not 'all_recommendations') - Include hardware in model name (e.g., 'Llama 3.1 8B on H100 x2') - Add blis_metrics to each recommendation (ttft_p95_ms, e2e_p95_ms, tokens_per_second) - Add get_model_pros() and get_model_cons() based on ACTUAL BLIS data - Score breakdown includes quality, latency, cost, capacity contributions Now recommendations show: - Model + Hardware combination - ACTUAL BLIS latency metrics - Hardware cost - SLO compliance status Signed-off-by: Yuval Luria <yuval750@gmail.com>
BUG: Code looked for data/benchmarks/benchmarks_BLIS.json FIX: File is at data/benchmarks_BLIS.json This was causing fallback to mock_recommendation() which shows models without hardware configurations (gpt-oss-20B, Apriel, etc.) Now BLIS data loads correctly and shows REAL Model+Hardware combos Signed-off-by: Yuval Luria <yuval750@gmail.com>
- Fixed path: data/benchmarks_BLIS.json (was data/benchmarks/benchmarks_BLIS.json) - Now recommendations show Model+Hardware combinations with ACTUAL BLIS metrics - Quality from AA CSVs + Latency/Throughput from BLIS = Complete recommendation Signed-off-by: Yuval Luria <yuval750@gmail.com>
Signed-off-by: Yuval Luria <yuval750@gmail.com>
Integrate Yuval's quality scoring with Andre's latency benchmarks Signed-off-by: Andre Fredette <afredette@redhat.com>
- Update LLM from llama3.1:8b to qwen2.5:7b - Expand backend structure with detailed module descriptions - Document data directory structure (47 models, 9 use cases, BLIS benchmarks) - Add use case configs directory documentation - Update Critical Data Collections with BLIS as benchmark source - Clarify PostgreSQL usage for serving benchmark data Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
…ia-anfredette docs: Update CLAUDE.md to reflect current codebase state Signed-off-by: Andre Fredette <afredette@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
Merge merge-yuvalluria branch into main Signed-off-by: Andre Fredette <afredette@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
…-flow doc: Add document describing the current recommendation flow Signed-off-by: Andre Fredette <afredette@redhat.com>
Move the Technical Spec (Optional Fields) card from a separate column into the same row as SLO TARGETS, WORKLOAD PROFILE, and TASK DATASETS. All four cards now share consistent styling and layout. Signed-off-by: Andre Fredette <afredette@redhat.com>
- Move token length fields from SLO Targets to Workload Profile column - Rename fields: Mean Prompt Tokens, Mean Output Tokens - Add hover tooltips explaining values are determined by use case - Remove unused Concurrency field - Remove BLIS Config message from SLO Targets (now shown in Workload Profile) - Remove dead functions: render_impact_factors_mini, render_impact_factors Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
QPS exceeding single-GPU benchmarks is handled by recommending more GPUs, not by warning the user. Removed unnecessary QPS-related warning messages. Signed-off-by: Andre Fredette <afredette@redhat.com>
refactor: Reorganize SLO cards layout and clean up workload UI
Replace UI-side optimal hardware recommendation with backend-driven ranked recommendations that provide 5 sorted views: - Balanced (weighted composite) - Best Accuracy (model capability) - Lowest Cost (price efficiency) - Lowest Latency (SLO headroom) - Simplest (deployment complexity) Backend changes: - Add /api/ranked-recommend-from-spec endpoint for spec-based ranking - Add RankedRecommendationFromSpecRequest schema - Add generate_ranked_recommendations_from_spec() workflow method UI changes: - Add fetch_ranked_recommendations() to call backend API - Add render_weight_controls() with accuracy/cost/latency/simplicity weights - Add render_ranked_recommendations() with unified table display - Add expand/collapse toggle buttons for viewing alternatives - Add "Re-Evaluate" button to apply configuration changes Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
…-display feat: Add ranked recommendations with backend API integration
Restructure the contributing guide to have the complete Git workflow in one place, from initial fork setup through submitting a PR. Previously the steps were scattered across Getting Started, Development Workflow, and Pull Request Guidelines sections. Signed-off-by: Andre Fredette <afredette@redhat.com>
…uting docs: Reorganize CONTRIBUTING.md with consolidated Git workflow
Signed-off-by: Amit Oren <amoren@redhat.com>
Signed-off-by: Amit Oren <amoren@redhat.com>
Delete UI code that duplicated backend-provided data: - Remove get_raw_aa_accuracy(), load_weighted_scores(), and BENCHMARK_TO_AA_NAME_MAP — use backend scores.accuracy_score directly - Remove calc_throughput() and USE_CASE_OUTPUT_TOKENS — use backend benchmark_metrics.tps_* directly - Replace hardcoded priority_weights dict in winner dialog with actual user weights from session state - Remove use_case_context and model_traits marketing copy dicts, replace with score-based summary Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
…eamlit - Delete 1,259-line main CSS block, keep only deploy-button-hide rule - Delete 6 inline CSS blocks (filter controls, SLO inputs, dialogs, etc.) - Replace render_hero() with native st.image() + st.title() + st.caption() - Replace render_score_bar() with st.progress() + native columns - Replace carousel component (204 lines) with simple category cards - Replace 11 section-header divs with st.subheader() - Remove 3 JavaScript tab-switching blocks - Strip all non-semantic hardcoded colors (~200 references), keeping only #ef4444 for errors/warnings and #10B981 for success indicators - Remove hardcoded dark-mode backgrounds, gradients, and theme-specific borders - Remove light-mode-only theme overrides from .streamlit/config.toml - Fix invisible text from leftover -webkit-text-fill-color: transparent - Fix 8 broken gradient remnants from partial CSS stripping - Restore NeuralNav logo using st.image() in column layout UI now relies on Streamlit's native theming for both light and dark mode. File reduced from ~5,039 to ~3,218 lines. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
- Replace 27 individual session state init blocks with SESSION_DEFAULTS dict + loop - Consolidate identical get_scores() / get_model_scores() into one module-level function - Simplify dialog mutual-exclusion (Streamlit @st.dialog handles this) - Extract render_slo_input() helper to replace 3 repeated TTFT/ITL/E2E input blocks - Remove 3 dead switch_to_tab writes (JS consumers removed in Phase 4) - Move per-metric percentile init into SESSION_DEFAULTS File reduced from 3,218 to 3,117 lines. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
- Split render_top5_table() into _render_filter_summary(), _render_category_card(), and slim orchestrator - Split render_slo_cards() into _render_slo_targets(), _render_workload_profile(), _render_accuracy_benchmarks(), _render_priorities(), and slim orchestrator - Move TASK_DATASETS to module level - Add prev/next navigation to category cards (< llm-d-incubation#1 of 5 >) allowing users to browse and select from all top-5 per category - Remove dead render_best_card(), unused best_* variables, duplicate fetch_workload_profile() call, and cascading unused parameters through 3 functions - Deduplicate 3 priority rows into a loop in _render_priorities() Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
Break the 3,035-line app.py into 10 focused files with no circular dependencies. Remove ~260 lines of dead code (render_weight_controls, render_ranked_recommendations, DATA_DIR, build_row, nested format_gpu_config). New structure: ui/app.py (447 lines) — entry point + tab routing ui/helpers.py — pure utility functions ui/state.py — session state defaults + init ui/api_client.py — all backend API calls ui/components/dialogs.py — winner/category/full-table dialogs ui/components/slo.py — SLO targets, workload, priorities ui/components/extraction.py — extraction display/approval/edit ui/components/recommendations.py — category cards, top-5 table ui/components/deployment.py — deployment tab Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
- Simplify UI layout, compact cards, auto-tab navigation - Reduce top whitespace, hide Deploy button via toolbarMode=viewer - Align hero icon/title, rename buttons across workflow - Replace HTML extraction grid with compact markdown display - Redesign recommendation cards with Solution/Scores/Values lines - Add circular navigation and clear selection on navigate - Auto-switch tabs on Generate Specification / Generate Recommendations - Remove RPS warnings and "Want Different Results?" section - Add Priorities column headers, update label text in Technical Spec - Clear deployment state when starting new analysis Signed-off-by: Andre Fredette <afredette@redhat.com>
Auto-fixed import sorting (I001), unused imports (F401), extraneous f-string prefix (F541), and Optional type annotation (UP007). Manually fixed set comprehension (C401), combined with-statements (SIM117), and unused variable (F841). Signed-off-by: Andre Fredette <afredette@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
Simplify UI: remove ~3,000 lines of complexity, split into modules
Promote ARCHITECTUREv2.md to the canonical ARCHITECTURE.md and move the original to archive/ARCHITECTUREv1.md. Update the internal cross-reference link to match the new filename. Signed-off-by: Andre Fredette <afredette@redhat.com>
docs: Rename ARCHITECTUREv2 to ARCHITECTURE and archive v1
Enable remote benchmark data management without shell access. Adds three API endpoints (GET /db/status, POST /db/upload-benchmarks, POST /db/reset) and a Configuration tab in the UI for uploading benchmark JSON files, viewing database statistics, and resetting data. Extracts shared loading logic into knowledge_base/loader.py, used by the CLI script, API endpoints, and UI.Includes config_id deduplication with requests_per_second for correctduplicate detection. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
After the PyPA src layout refactoring, BACKEND_DIR was never defined, breaking make lint/format/test targets. Rename to SRC_DIR to match the actual directory name and fix lint/format glob paths. Signed-off-by: Andre Fredette <afredette@redhat.com>
feat: add database management via REST API and UI Configuration tab
…oling Add complete Kubernetes manifests for deploying NeuralNav to OpenShift: - Namespace, Secrets, and OpenShift Routes (TLS edge termination) - PostgreSQL 16 with PVC-backed persistent storage - Ollama with startup model pull (granite3.3:2b) and 20Gi PVC for models - Backend API with environment-driven config (DATABASE_URL, OLLAMA_HOST, OLLAMA_MODEL) and health/readiness probes - Streamlit UI with Streamlit-native health probes - db-init Job that waits for PostgreSQL, initializes the schema, and loads all benchmark datasets (with bounded retry loop) - deploy-all.sh orchestration script for ordered deployment via oc Dockerfile updates for OpenShift compatibility: - Pin platform to linux/amd64 for consistent cross-platform builds - Run uvicorn directly via PATH instead of uv run - Include scripts/ and README.md in image for db-init and hatchling - Healthcheck validates HTTP status with raise_for_status() - Restrict generated_configs/logs directory permissions to 750 OllamaClient refactored to use ollama.Client instance (required for connecting to remote Ollama hosts in K8s) with OLLAMA_MODEL and OLLAMA_HOST environment variable support. Also adds Makefile docker-push target and increases the UI extract timeout to 300s for CPU-only Ollama inference. Co-authored-by: Andre Fredette <afredette@redhat.com> Signed-off-by: Amit Oren <amoren@redhat.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
Feature: k8s deployment
docs: Add agentic design document
Add dedicated targets to build and push the backend Docker image to quay.io, following the same pattern as the existing simulator targets. Signed-off-by: Amit Oren <amoren@redhat.com>
…ackend-targets feat: add build-backend and push-backend Makefile targets
Structured technical evaluation covering functional comparison, architectural analysis, overlap/complementarity assessment, and integration feasibility between llm-d-benchmark's config_explorer and NeuralNav. Proposes offline synthetic benchmark generation as the recommended integration path. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Andre Fredette <afredette@redhat.com>
Makes the case for merging both projects into a single upstream tool in the llm-d organization. Covers strategic rationale, unified architecture, phased integration roadmap, and risk analysis with mitigations. Signed-off-by: Andre Fredette <afredette@redhat.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThree new documentation files propose and evaluate the integration of config_explorer and NeuralNav projects. The documents include architectural comparisons, integration strategies, feasibility assessments, and governance models for combining the two tools into a unified upstream project or ecosystem collaboration. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/CE_NN_INTEGRATION_PROPOSAL.md (1)
36-53: Make comparative capability claims auditable.The capability matrix uses definitive “None/Shared” overlap labels, but doesn’t point to evidence boundaries (code refs, benchmark artifacts, or date/commit scope). For a governance-facing doc, this makes technical conclusions hard to validate and maintain over time.
A compact “Evidence & scope” note under the table (sources + snapshot date) would make the claims reviewable.
As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/CE_NN_INTEGRATION_PROPOSAL.md` around lines 36 - 53, Add an "Evidence & scope" note immediately under the capability matrix that lists the data sources and snapshot scope used to label overlaps (e.g., code references, benchmark artifact names, and commit hashes or dates) and a short statement of the review cutoff date; explicitly map ambiguous labels ("None"/"Shared") to the evidence (for example: "None — verified by absence of features in config_explorer repo at commit abc123 and no benchmarking artifacts; Shared — both projects reference GPU cost lookup data source X dated 2025-01-15"). Ensure the note is compact (1–3 lines) and includes at least one pointer to code/benchmark artifacts and the snapshot date so the claims become auditable and maintainable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/CE_NN_INTEGRATION_PROPOSAL.md`:
- Line 1: The proposal currently presents a unified upstream project but
conflicts with the alternative independent-collaboration approach in
docs/SIG_BENCHMARKING_ISSUE.md; add a short "Positioning" section at the top of
CE_NN_INTEGRATION_PROPOSAL.md (near the title) that explicitly states whether
the document advocates (1) the preferred direction of a single unified upstream
project or (2) an alternative option to pursue independent collaboration between
config_explorer and NeuralNav, and briefly summarize the rationale and
implications for SIG review; also update the corresponding paragraphs around
lines referenced (15-16 and 61-67) to reflect this positioning so the narrative
is consistent with SIG_BENCHMARKING_ISSUE.md.
In `@docs/CONFIG_EXPLORER_EVALUATION.md`:
- Around line 331-334: Tighten the POC acceptance thresholds in the "Accuracy"
criterion: change the aggregate pass-rate requirement from ">60% of tested
combos within 30%" to a stricter rule such as "≥80% of tested (model, GPU,
traffic_profile) combos within ±20%" and add per-traffic_profile minimums (e.g.,
each traffic_profile must meet ≥75% pass rate). For p95-sensitive metrics
(TTFT_p95, ITL_p95, E2E_p95) tighten the allowable error band to ±15% and
require these metrics individually to pass the threshold (not just aggregate).
Remove or narrow the caveat allowing 30–50% error—replace with a hard fail
threshold (e.g., any metric >25% error flagged as fail) and require documented
acceptance justification for any exceptions. Ensure these changes apply
alongside the "Coverage expansion" and "End-to-end flow" checks so synthetic
benchmarks only proceed when both aggregate and per-profile/p95 conditions are
met.
---
Nitpick comments:
In `@docs/CE_NN_INTEGRATION_PROPOSAL.md`:
- Around line 36-53: Add an "Evidence & scope" note immediately under the
capability matrix that lists the data sources and snapshot scope used to label
overlaps (e.g., code references, benchmark artifact names, and commit hashes or
dates) and a short statement of the review cutoff date; explicitly map ambiguous
labels ("None"/"Shared") to the evidence (for example: "None — verified by
absence of features in config_explorer repo at commit abc123 and no benchmarking
artifacts; Shared — both projects reference GPU cost lookup data source X dated
2025-01-15"). Ensure the note is compact (1–3 lines) and includes at least one
pointer to code/benchmark artifacts and the snapshot date so the claims become
auditable and maintainable.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (3)
docs/CE_NN_INTEGRATION_PROPOSAL.mddocs/CONFIG_EXPLORER_EVALUATION.mddocs/SIG_BENCHMARKING_ISSUE.md
| @@ -0,0 +1,103 @@ | |||
| # Proposal: Unifying config_explorer and NeuralNav into a Single Upstream Project | |||
There was a problem hiding this comment.
Align this document’s strategy with the SIG proposal narrative.
This file positions a single unified upstream project, while docs/SIG_BENCHMARKING_ISSUE.md explicitly proposes collaboration with both tools remaining independent. That strategic mismatch will confuse stakeholders during SIG review and weakens the proposal package coherence.
Consider adding a short “Positioning” section near Line 1 that clearly states whether this is:
- the preferred direction, or
- an alternative to the independent-collaboration path in the SIG issue.
As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 15-16, 61-67
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/CE_NN_INTEGRATION_PROPOSAL.md` at line 1, The proposal currently
presents a unified upstream project but conflicts with the alternative
independent-collaboration approach in docs/SIG_BENCHMARKING_ISSUE.md; add a
short "Positioning" section at the top of CE_NN_INTEGRATION_PROPOSAL.md (near
the title) that explicitly states whether the document advocates (1) the
preferred direction of a single unified upstream project or (2) an alternative
option to pursue independent collaboration between config_explorer and
NeuralNav, and briefly summarize the rationale and implications for SIG review;
also update the corresponding paragraphs around lines referenced (15-16 and
61-67) to reflect this positioning so the narrative is consistent with
SIG_BENCHMARKING_ISSUE.md.
| 1. **Accuracy**: For (model, GPU, traffic_profile) combos where NeuralNav has real GuideLLM benchmarks, the roofline estimates agree within 30% on TTFT, ITL, and E2E latency for the majority (>60%) of tested combos. | ||
| 2. **Coverage expansion**: The synthetic generator produces valid estimates for at least 3 model+GPU combos that NeuralNav currently cannot recommend. | ||
| 3. **End-to-end flow**: Synthetic benchmarks loaded into NeuralNav are scored, ranked, and displayed with an "Estimated" indicator -- no existing recommendation quality degraded. | ||
|
|
There was a problem hiding this comment.
POC acceptance thresholds are too loose for ranking-quality decisions.
Proceeding when latency estimates are within 30% for just >60% of combos (and still allowing 30–50% error with caveats) risks introducing noisy synthetic data into recommendation ranking. That can degrade trust in “best” outputs.
Please tighten go/no-go criteria (e.g., stricter error bands for p95-sensitive metrics and minimum pass rates per traffic profile), not only aggregate pass rates.
As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 373-375
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/CONFIG_EXPLORER_EVALUATION.md` around lines 331 - 334, Tighten the POC
acceptance thresholds in the "Accuracy" criterion: change the aggregate
pass-rate requirement from ">60% of tested combos within 30%" to a stricter rule
such as "≥80% of tested (model, GPU, traffic_profile) combos within ±20%" and
add per-traffic_profile minimums (e.g., each traffic_profile must meet ≥75% pass
rate). For p95-sensitive metrics (TTFT_p95, ITL_p95, E2E_p95) tighten the
allowable error band to ±15% and require these metrics individually to pass the
threshold (not just aggregate). Remove or narrow the caveat allowing 30–50%
error—replace with a hard fail threshold (e.g., any metric >25% error flagged as
fail) and require documented acceptance justification for any exceptions. Ensure
these changes apply alongside the "Coverage expansion" and "End-to-end flow"
checks so synthetic benchmarks only proceed when both aggregate and
per-profile/p95 conditions are met.
Draft issue proposing contributing NeuralNav to the llm-d SIG Benchmarking ecosystem and enabling collaboration with Config Explorer. Signed-off-by: Andre Fredette <afredette@redhat.com>
1bfb3bb to
d8f55f3
Compare
Docs Include
Context
These documents support an upcoming proposal to the llm-d SIG Benchmarking
community to contribute NeuralNav and explore integration with Config Explorer.
NOTES:
Summary by CodeRabbit