Skip to content

[MLOPS] Automated Acoustic Regression Testing (Golden Dataset) #37

@scottchronicity

Description

@scottchronicity

Context

When we update BirdNET or AVES models, how do we know accuracy hasn't regressed? Build a CI pipeline step that tests new models against a "Golden Dataset" — a curated set of audio recordings with known-correct species labels. If accuracy drops below a threshold, the pipeline blocks the merge.

No regressions in production. Ever.

Architecturally Significant Requirements (ASRs)

Interface (Contract):

  • CLI: orpheus-model-eval --model <path> --dataset <path> --threshold <float> --output <report_path>.
  • ModelEvaluator class with methods: evaluate(model_path: str, dataset_path: str) -> EvaluationReport.
  • EvaluationReport dataclass: {"overall_accuracy": float, "per_species": Dict[str, SpeciesMetrics], "passed": bool, "threshold": float}.
  • SpeciesMetrics dataclass: {"precision": float, "recall": float, "f1": float, "sample_count": int}.
  • CI output: Markdown report posted as PR comment.

Implementation (Internal Logic):

  • Golden Dataset stored in Git LFS or DVC with manifest file listing audio paths and ground-truth labels.
  • Inference runner: load model, process each audio file, collect predictions.
  • Metric calculator: per-species precision/recall/F1 plus overall accuracy.
  • Regression gate: fail if any species F1 drops more than threshold (default 5%) vs. baseline.

Architectural Constraints

  • Golden Dataset must be stored in a reproducible location (Git LFS, S3, or DVC — not checked into the repo directly).
  • Evaluation metrics: precision, recall, F1 per species, plus overall accuracy.
  • Threshold for blocking must be configurable (default: no species drops more than 5% F1).
  • Pipeline must produce a human-readable report (Markdown or HTML) attached to the PR.
  • Must be Python 3.9 compatible.

Acceptance Criteria

Feature: Automated Acoustic Regression Testing

  Scenario: Model passes regression check
    Given a Golden Dataset with 50 labeled recordings
    And the baseline model achieves 0.85 F1 for American Crow
    When the new model achieves 0.87 F1 for American Crow
    Then the regression check passes
    And the PR report shows improvement for American Crow

  Scenario: Model fails regression check
    Given a Golden Dataset with 50 labeled recordings
    And the baseline model achieves 0.85 F1 for American Crow
    When the new model achieves 0.78 F1 for American Crow (> 5% drop)
    Then the regression check fails
    And the CI pipeline blocks the merge
    And the PR report highlights the regression

  Scenario: Generate human-readable report
    Given a completed model evaluation
    When the report is generated
    Then it contains per-species precision, recall, and F1 scores
    And it clearly marks regressions vs. improvements

Definition of Done

  • Golden Dataset curated with at least 50 labeled recordings covering the top 10 detected species.
  • Evaluation script that runs inference and computes per-species metrics.
  • CI pipeline step that runs on model-update PRs and blocks on regression.
  • Report template showing before/after metrics in the PR comment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C4: ComponentInternal logical boundaries (Classifiers, Managers)domain: hardwareJetson, GPIO, ALSA, or physical sensorsdomain: mlopsModel evaluation and data pipelinesstaletype: featureNew functionalitytype: infrastructureMakefiles, CI/CD, Docker, K8s

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions