chore: add conventional commits pre-commit hook#84
Conversation
Migrate the evalhub_adapter package from the original behavioral-tests repo into tests/behavioral/evalhub_adapter/ as a first-class source package with unit tests and benchmark fixture files. - Adapter: AgenticEvalAdapter implementing FrameworkAdapter interface with 6-phase eval pipeline (init, load, eval, post-process, persist, report), 10 scorer dispatches, MLflow trace enrichment - Config: AgenticEvalParams bridge from JobSpec to TaskConfig - Benchmarks: 7 benchmark definitions with golden query YAML loader - Fixtures: tool_use.yaml with 5 queries, 6 stub YAMLs for remaining benchmarks - Tests: 32 unit tests covering config, benchmarks, scorer resolution, YAML loading, real fixture parsing, and edge cases - pyproject.toml: evalhub optional dep, unit marker, package include Key design decisions: - Lazy __init__.py import so unit tests run without evalhub SDK - verify_ssl defaults to True with warning log when disabled - query_error metric excluded from overall score computation - _report_fatal accepts phase parameter for accurate EvalHub reporting - MLflow gating requires both tracking_uri and experiment_name RHAIENG-4604 Made-with: Cursor
Strip speculative code from the EvalHub adapter migration: - Remove 6 empty benchmark YAML stubs (keep only agentic-tool-use) - Trim BENCHMARKS registry to the one working benchmark - Remove unused QuerySpec.difficulty/category fields - Remove unused TaskConfig import from adapter.py - Simplify __init__.py to plain import - Fix main() docstring to reference FrameworkAdapter base class Add missing deliverables: - README.md documenting adapter design, what works, what's planned - test_adapter.py with 28 unit tests for scorer dispatch and aggregation - .gitignore entries for .cursor/ and REFACTORING.md Made-with: Cursor
- Treat TaskResult(success=False) as a query failure before scoring; previously failed agent calls bypassed failed_count and ran through normal scorers - Include query_error in overall_score so failed queries penalize the result instead of being silently excluded - Escape regex metacharacter in test match= pattern Made-with: Cursor
- Move evalhub stub from test_adapter.py to conftest.py so tests work regardless of collection order or file selection - Wrap MLflow trace enrichment in its own try/except to prevent flaky MLflow from invalidating successful query results - Validate missing 'query' key in YAML entries with descriptive ValueError; widen adapter's load_queries catch to include ValueError - Strip orphan difficulty/category fields from tool_use.yaml - Add __post_init__ validation for positive timeout/latency values - Broaden .gitignore REFACTORING.md pattern to **/REFACTORING.md - Add TODO for asyncio.gather concurrent query execution - Update README: validation docs, MLflow fault tolerance, sequential execution note, conftest.py stub explanation Made-with: Cursor
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces an EvalHub on-cluster adapter that bridges pytest-based behavioral evaluations with EvalHub's Kubernetes orchestration. It includes configuration validation, benchmark management, scoring aggregation, optional MLflow integration, and comprehensive test coverage with stubs for the evalhub package. Changes
Sequence Diagram(s)sequenceDiagram
participant EvalHub
participant AgenticEvalAdapter
participant BenchmarkMgr as Benchmark Manager
participant QueryLoader as Query Loader
participant Agent as Agent Server
participant Scorers
participant MLflow
EvalHub->>AgenticEvalAdapter: run_benchmark_job(JobSpec)
activate AgenticEvalAdapter
AgenticEvalAdapter->>AgenticEvalAdapter: validate config
AgenticEvalAdapter->>BenchmarkMgr: get_benchmark(benchmark_id)
BenchmarkMgr-->>AgenticEvalAdapter: BenchmarkDef
AgenticEvalAdapter->>QueryLoader: load_queries(benchmark)
QueryLoader-->>AgenticEvalAdapter: list[QuerySpec]
AgenticEvalAdapter->>AgenticEvalAdapter: init MLflow (optional)
loop for each query
AgenticEvalAdapter->>Agent: run_task(query, config)
Agent-->>AgenticEvalAdapter: TaskResult
AgenticEvalAdapter->>Scorers: score_result(task_result, scorers)
loop for each scorer
Scorers-->>AgenticEvalAdapter: Score
end
alt MLflow enabled
AgenticEvalAdapter->>MLflow: log trace enrichment
end
end
AgenticEvalAdapter->>AgenticEvalAdapter: aggregate_scores()
AgenticEvalAdapter->>AgenticEvalAdapter: compute_overall_score()
alt has results
AgenticEvalAdapter->>MLflow: log metrics/params
end
AgenticEvalAdapter-->>EvalHub: JobResults
deactivate AgenticEvalAdapter
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
|
Large PR detected (1700 lines changed) This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs. Consider splitting this PR into smaller, focused changes. |
Description
Adds a pre-commit hook to enforce the Conventional Commits specification on all commit messages. Uses the espressif/conventional-precommit-linter hook, which validates commit message format at the
commit-msgstage.Changes:
.pre-commit-config.yaml: new file with the conventional commits hook configured with allowed types, subject length limits, and breaking change supportCONTRIBUTING.md: added "Development setup" section with pre-commit install instructions and updated "Commit message conventions" with the enforced format, all allowed types, and examplesJira Ticket
RHAIENG-4066
Testing
make testpasses (run from the affected agent directory)Verified locally:
"updated stuff","chore: doc") are rejected"chore: add conventional commits pre-commit hook") passChecklist
.envor secret files are included in this PRReview Guidance
Review Guidance
.pre-commit-config.yaml— review the args to confirm the team is happy with the allowed types and subject length limits (min 10, max 72)--scopes: restrict to specific scopes (currently unrestricted)--body-max-line-length: limit body line length (currently unrestricted)--summary-uppercase: enforce uppercase first letter--scope-case-insensitive: allow uppercase in scopesruff.toml,.markdownlint.json) are in place.Related PRs
#80 (Tarun's Ruff linting and formatting — provides
ruff.tomlthat future pre-commit hooks will use)