Context
When we update BirdNET or AVES models, how do we know accuracy hasn't regressed? Build a CI pipeline step that tests new models against a "Golden Dataset" — a curated set of audio recordings with known-correct species labels. If accuracy drops below a threshold, the pipeline blocks the merge.
No regressions in production. Ever.
Architecturally Significant Requirements (ASRs)
Interface (Contract):
- CLI:
orpheus-model-eval --model <path> --dataset <path> --threshold <float> --output <report_path>.
ModelEvaluator class with methods: evaluate(model_path: str, dataset_path: str) -> EvaluationReport.
EvaluationReport dataclass: {"overall_accuracy": float, "per_species": Dict[str, SpeciesMetrics], "passed": bool, "threshold": float}.
SpeciesMetrics dataclass: {"precision": float, "recall": float, "f1": float, "sample_count": int}.
- CI output: Markdown report posted as PR comment.
Implementation (Internal Logic):
- Golden Dataset stored in Git LFS or DVC with manifest file listing audio paths and ground-truth labels.
- Inference runner: load model, process each audio file, collect predictions.
- Metric calculator: per-species precision/recall/F1 plus overall accuracy.
- Regression gate: fail if any species F1 drops more than threshold (default 5%) vs. baseline.
Architectural Constraints
- Golden Dataset must be stored in a reproducible location (Git LFS, S3, or DVC — not checked into the repo directly).
- Evaluation metrics: precision, recall, F1 per species, plus overall accuracy.
- Threshold for blocking must be configurable (default: no species drops more than 5% F1).
- Pipeline must produce a human-readable report (Markdown or HTML) attached to the PR.
- Must be Python 3.9 compatible.
Acceptance Criteria
Feature: Automated Acoustic Regression Testing
Scenario: Model passes regression check
Given a Golden Dataset with 50 labeled recordings
And the baseline model achieves 0.85 F1 for American Crow
When the new model achieves 0.87 F1 for American Crow
Then the regression check passes
And the PR report shows improvement for American Crow
Scenario: Model fails regression check
Given a Golden Dataset with 50 labeled recordings
And the baseline model achieves 0.85 F1 for American Crow
When the new model achieves 0.78 F1 for American Crow (> 5% drop)
Then the regression check fails
And the CI pipeline blocks the merge
And the PR report highlights the regression
Scenario: Generate human-readable report
Given a completed model evaluation
When the report is generated
Then it contains per-species precision, recall, and F1 scores
And it clearly marks regressions vs. improvements
Definition of Done
Context
When we update BirdNET or AVES models, how do we know accuracy hasn't regressed? Build a CI pipeline step that tests new models against a "Golden Dataset" — a curated set of audio recordings with known-correct species labels. If accuracy drops below a threshold, the pipeline blocks the merge.
No regressions in production. Ever.
Architecturally Significant Requirements (ASRs)
Interface (Contract):
orpheus-model-eval --model <path> --dataset <path> --threshold <float> --output <report_path>.ModelEvaluatorclass with methods:evaluate(model_path: str, dataset_path: str) -> EvaluationReport.EvaluationReportdataclass:{"overall_accuracy": float, "per_species": Dict[str, SpeciesMetrics], "passed": bool, "threshold": float}.SpeciesMetricsdataclass:{"precision": float, "recall": float, "f1": float, "sample_count": int}.Implementation (Internal Logic):
Architectural Constraints
Acceptance Criteria
Definition of Done