[MLOPS] Automated Acoustic Regression Testing (Golden Dataset)

### Context

When we update BirdNET or AVES models, how do we know accuracy hasn't regressed? Build a CI pipeline step that tests new models against a "Golden Dataset" — a curated set of audio recordings with known-correct species labels. If accuracy drops below a threshold, the pipeline blocks the merge.

No regressions in production. Ever.

### Architecturally Significant Requirements (ASRs)

**Interface (Contract):**
- CLI: `orpheus-model-eval --model <path> --dataset <path> --threshold <float> --output <report_path>`.
- `ModelEvaluator` class with methods: `evaluate(model_path: str, dataset_path: str) -> EvaluationReport`.
- `EvaluationReport` dataclass: `{"overall_accuracy": float, "per_species": Dict[str, SpeciesMetrics], "passed": bool, "threshold": float}`.
- `SpeciesMetrics` dataclass: `{"precision": float, "recall": float, "f1": float, "sample_count": int}`.
- CI output: Markdown report posted as PR comment.

**Implementation (Internal Logic):**
- Golden Dataset stored in Git LFS or DVC with manifest file listing audio paths and ground-truth labels.
- Inference runner: load model, process each audio file, collect predictions.
- Metric calculator: per-species precision/recall/F1 plus overall accuracy.
- Regression gate: fail if any species F1 drops more than threshold (default 5%) vs. baseline.

### Architectural Constraints

- Golden Dataset must be stored in a reproducible location (Git LFS, S3, or DVC — not checked into the repo directly).
- Evaluation metrics: precision, recall, F1 per species, plus overall accuracy.
- Threshold for blocking must be configurable (default: no species drops more than 5% F1).
- Pipeline must produce a human-readable report (Markdown or HTML) attached to the PR.
- Must be Python 3.9 compatible.

### Acceptance Criteria

```gherkin
Feature: Automated Acoustic Regression Testing

  Scenario: Model passes regression check
    Given a Golden Dataset with 50 labeled recordings
    And the baseline model achieves 0.85 F1 for American Crow
    When the new model achieves 0.87 F1 for American Crow
    Then the regression check passes
    And the PR report shows improvement for American Crow

  Scenario: Model fails regression check
    Given a Golden Dataset with 50 labeled recordings
    And the baseline model achieves 0.85 F1 for American Crow
    When the new model achieves 0.78 F1 for American Crow (> 5% drop)
    Then the regression check fails
    And the CI pipeline blocks the merge
    And the PR report highlights the regression

  Scenario: Generate human-readable report
    Given a completed model evaluation
    When the report is generated
    Then it contains per-species precision, recall, and F1 scores
    And it clearly marks regressions vs. improvements
```

### Definition of Done

- [ ] Golden Dataset curated with at least 50 labeled recordings covering the top 10 detected species.
- [ ] Evaluation script that runs inference and computes per-species metrics.
- [ ] CI pipeline step that runs on model-update PRs and blocks on regression.
- [ ] Report template showing before/after metrics in the PR comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLOPS] Automated Acoustic Regression Testing (Golden Dataset) #37

Context

Architecturally Significant Requirements (ASRs)

Architectural Constraints

Acceptance Criteria

Definition of Done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[MLOPS] Automated Acoustic Regression Testing (Golden Dataset) #37

Description

Context

Architecturally Significant Requirements (ASRs)

Architectural Constraints

Acceptance Criteria

Definition of Done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions