Agent health check#623
Draft
jmelburg wants to merge 20 commits into
Draft
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #623 +/- ##
==========================================
+ Coverage 91.57% 91.76% +0.18%
==========================================
Files 110 111 +1
Lines 9636 10075 +439
==========================================
+ Hits 8824 9245 +421
- Misses 812 830 +18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Collaborator
|
@jmelburg are we still using the create_metric_itable method? I had introduced it to behave similar to the create_metric_gttable that we use in the health checks. |
Collaborator
Author
|
@operdeck We use it for DSI, I customized it a bit when upgrading DSI but it should still have the same formatting and thresholds as health check |
42c4697 to
1523d07
Compare
- Reports.health_check_agent: PathLike (was undefined os.PathLike) - test_Analysis: zip(..., strict=True) for B905 - test_Reports.py: trailing newline cleanup from end-of-file-fixer Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Analysis is an ADMDatamart namespace, so tests/adm/ is the right home (matches tests/adm/test_Aggregates.py for the sibling namespace). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per AGENTS.md: `verbose` is reserved for genuine user-facing progress (tqdm bars, multi-stage CLI). Internal path echoes are diagnostic detail and belong on the logger. `Reports.health_check` was already using `logger.info` on its failure path; this aligns `health_check_agent` with that pattern. Side effects: - Drops two latent run_quarto kwargs that don't exist on the function (`verbose=`, `size_reduction_method=`). Tests are Quarto-gated and marked slow, so the call would have crashed at first real invocation. - CHANGELOG entry under [5.0.0] Changed documenting the breaking removal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per AGENTS.md "exact-value assertions over structural checks". Six
tests in test_Analysis.py used `> 0` / `len(...) > 0` to assert the
shape of finding output. All replaced with exact counts derived from
the fixture data (cdh_sample or _make_dm), with comments noting the
re-pin trigger if upstream sample data changes.
- TestFindings::test_findings_have_nonempty_title_and_detail: also
asserts at least one finding exists (previously vacuously passed
on an empty list).
- TestModelMaturityChecks: 3 maturity findings on cdh_sample.
- TestPredictorChecks: 2 predictor findings on cdh_sample.
- TestFullAnalysis: 10 total findings, exact category set
{model, taxonomy, predictor}.
- TestCheckTaxonomyEdgeCases (RED/AMBER): 4 findings each (one per
metric: actions, treatments, issues, channels) plus title check.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three plan files capturing follow-ups identified in the PR #623 architecture-review pass that were too invasive for a single review-pass PR (per the scope-bail rule in the review brief): - adm/analysis-namespace-facade.md (P3): `Analysis` has 9 private helpers but only 1 public method. AGENTS.md namespace-facade trigger (>20 public methods) isn't met. Split would be ~4h of mechanical work + 73 test-mock-path rewrites for pure dotted-name benefit. - adm/analysis-aggregates-overlap-audit.md (P3): per-method audit result. Most checks already delegate to dm.aggregates; the only borderline duplicate (_check_trends per-snapshot per-channel weighted avg) has no existing aggregate to delegate to. No code changes recommended at this time. - health-check/agent-report-qmd-dedup.md (P3): HealthCheck.qmd vs HealthCheckAgent.qmd share substantial preamble. Quarto restructure is risky (slow Quarto-gated tests; subtle include semantics) — split into a focused PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1523d07 to
38ccb9a
Compare
Validated against 8 production CDH datasets (Rabobank, Wells Fargo, Achmea, Bank of Ireland, Verizon, Standard Chartered, T-Mobile US, Singtel). Output went from 32-39 findings down to 18-25 with the following changes: - Collapse N x 'Channel X has zero responses' into one aggregate finding per direction with the channel list truncated at 5 (Verizon: 16 dead-channel lines -> 2; T-Mobile: 14 -> 2). - Add a 'data_quality' category and _check_data_quality helper that flags malformed Channel/Direction values (empty, '/', '/Inbound', 'Web/'). Invalid channels are excluded from the dead-channel aggregate so the counts stay clean. - Drop the per-dead-channel 'has N model configurations' info loop in _check_channels - it added zero actionable detail and only ever fired for already-dead channels. - Add a 'summary' category with a �� stethoscope icon and a single headline finding at index 0 derived from the underlying check output (% never used, dead-channel count, weighted average AUC). Verdict thresholds: NEEDS ATTENTION (>25% never-used / >2 dead channels with mature low-perf / AUC<55), MIXED, HEALTHY. - Tier the overall-AUC finding: <58 critical, 58-62 warning, 62-70 info, ≥70 success (new severity with ✅ icon). - Add a 'mature_low_perf' percentage bucket so the five maturity bins (never_used / responses_no_positives / immature / mature_low_perf / mature_decent) are mutually exclusive and sum to exactly 100% on every real dataset. Use bp_min ModelPerformance (0.55) as the cutoff so 'healthy' and 'low' agree. - Surface explicit thresholds and direction in taxonomy findings: 'Total treatments (1) is below minimum (2)', 'Total actions (7,059) is above typical range (>2500)' instead of the generic 'outside recommended limits'. - Append the dominant Channel/Direction to 'Configuration X has no positives' findings so triagers know which channel to investigate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added exact-value tests in test_Analysis.py for: - _check_data_quality detection of empty / '/' / '/Inbound' values and exact model count (3 invalid out of 4 rows) - Collapsed dead-channel findings with singular/plural noun and '+N more' truncation past 5 channels - Headline summary finding (HEALTHY / MIXED / NEEDS ATTENTION) with 🩺 icon - Tiered AUC: critical < 58, warning 58-62, info 62-70, success ≥ 70 - Maturity buckets summing exactly to total model count - Taxonomy 'below minimum' / 'above typical range' direction in title - Configuration channel hint '(Push/Outbound)' on no-positives finding Updated existing tests for: - New 'success' severity in valid_severities - New 'data_quality' / 'summary' categories in valid_categories - _check_model_performance now returns (findings, avg_auc) tuple - _check_channels no longer emits 'has N model configurations' - Pinned counts: maturity 3 -> 4, full findings 10 -> 12 - Taxonomy title format check uses new direction-aware strings Deferred to plan files: - docs/plans/adm/findings-severity-ranking.md (issue #10) - docs/plans/adm/findings-recommendations.md (issue #8) 89 tests pass; ruff and pre-commit clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The visible maturity buckets are intentionally a 'visible cohorts' view,
not a 100 % decomposition. Mature low-performance models are surfaced
through dedicated count findings ('N mature models have low performance',
'N stuck at AUC=50') rather than a percentage bucket — surfacing them
twice produced duplicate signal in the agent output.
Restores the 'decent' cutoff to min_perf (0.52) so the bucket boundary
matches the AUC threshold used elsewhere in the report.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The architecture pass moved test_Reports.py into python/tests/healthcheck/ but the workflow step still pointed at the old top-level path, so the agent-report job collected zero tests and failed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agent-ready markdown version of the health check, to be added to the MCP