feat: re-add Agora as recommended pipeline, fix Polis repness bugs#99
Conversation
|
Test fails because best-agree was removed |
|
Hi @patcon, any feedback, so we can merge? |
|
Hey @patcon, could you tell me how to go through to the one statement among all the And if we use (I am a noob when it comes to working with tf data objects) (As I said earlier, it's possible we don't have any best-agree at all with that method if we only have disagree representative opinions, but that seems fine to me?) |
|
I need your help @patcon to understand why there are so many test errors |
a06b6d8 to
c67cb6e
Compare
- Remove unused 'stat' import - Simplify significance checks by removing redundant vote count validations - Streamline repful_for calculation by removing nested conditionals - Lower minimum confidence threshold from 0.7 to 0.6 for statement selection - Improve confidence selection to prefer exact pick_max matches over near-misses - Remove best-agree flag assignment logic
e6fd28d to
cd8d2e0
Compare
The fallback path in select_representative_statements() was returning a raw pandas Series instead of a properly formatted PolisRepnessStatement dict. This caused downstream consumers (e.g. Zod schema validation) to reject the entire math update when any cluster triggered the fallback, because the output had raw column names (na, nd, statement_id) instead of the expected keys (tid, n-success, repful-for). The fallback now: - Formats through format_comment_stats() like the non-fallback path - Handles best_overall=None by producing an empty list instead of [None]
The group-aware consensus score used raw p_agree per group in the geometric mean, completely ignoring p_disagree. This meant a group genuinely divided (similar levels of agree and disagree) contributed the same score as an undivided group with the same agree level, allowing divided groups to be masked by other groups' strong agreement. Replace raw p_agree with "effective agreement": p_agree * (1 - p_disagree). This discounts each group's agreement by its disagreement, so a divided group naturally drags down the consensus score while still producing a continuous ranking. Also document all divergences from the original Polis algorithm: - Geometric mean normalization (existing) - Effective agreement (new) - Progressive confidence lowering for representative statements (existing) - Significance-based repful_for determination (existing)
## Why this change The Polis repness algorithm has several issues that motivated both targeted fixes to the Polis implementation and a new Agora implementation that takes a fundamentally different approach. ### Problems with Polis repness selection The stock Polis `select_representative_statements()` uses a cascade of ad-hoc heuristics: `beats_best_of_agrees()`, then `beats_best_by_repness_test()`, then `beats_best_of_agrees()` again, capped by `pick_max=5`. This approach: - Has no statistical foundation — `pick_max=5` is arbitrary and doesn't adapt to data. A conversation with 10 statements and one with 1000 get the same cutoff. - The `best-agree` heuristic was buggy — it used disagree data for agree statements due to a repful_for calculation error. - The three-regime cascade makes it impossible to produce a single ranked list of all statements, which is what library users need to build their own UIs and selection logic. ### Polis bug fixes (in select_representative_statements) These fix the existing Polis implementation directly: - Fix repful_for calculation to use correct test statistics (pat/pdt) - Remove buggy best-agree/best-of-agrees heuristic - Simplify selection cascade - Fix format_comment_stats to use correct direction fields - Filter out statements with no agree or disagree votes ### Why we keep both implementations The Polis implementation stays (with the above fixes) because it serves as a **reference baseline** — users need to verify results match stock Polis behavior. Agora is a separate implementation that shares the same core pipeline (PCA + KMeans) but replaces selection and consensus with principled statistics. ### Why Agora ranks ALL statements then selects Instead of a black-box that returns 5 statements, Agora returns every statement ranked by effect size with a `selected` flag from Benjamini-Hochberg. This lets library consumers: - Show the full ranking in a UI - Apply their own selection criteria - Understand why a statement was or wasn't selected (via adjusted p-values) ### Why Simes' p-value combination (not max, not Fisher) Each statement has two test statistics: a probability test (is this group's agreement rate significantly above 50%?) and a representativeness test (does this group agree more than others?). We need to combine them into one p-value for BH. - `max(p_prob, p_rep)` (intersection test) — our first attempt. Too conservative: requires BOTH tests independently significant. With m=42 hypotheses and BH fdr=0.10, rank 1 threshold is only 0.0024 (z>=2.81). Result: 0/0/1 statements selected across 3 groups. - Fisher's method — assumes independence between tests. But our tests are positively dependent (groups that agree more show higher probability AND higher representativeness). Too lenient: 22/0/17. - Simes' combination `min(2*p_min, p_max)` — valid under positive dependence (Sarkar & Chang 1997). This is exactly our case. Result: 4/0/8 (vs Polis 5/1/5 — comparable but data-driven). ### Why effective agreement GAC Polis GAC uses `prod(pa)^(1/n)` — geometric mean of agreement rates. A group where 80% agree AND 60% disagree scores the same as one where 80% agree and 10% disagree. Agora uses `prod(pa*(1-pd))^(1/n)` to discount agreement by disagreement, penalizing divided groups. ### Why zero-vote filtering (in both Polis and Agora) Statements with na=0 and nd=0 get p-values from Laplace smoothing noise. In Polis, they're now filtered by `is_statement_significant()`. In Agora, they're excluded from BH hypothesis count via `apply_bh_with_vote_filter()` to avoid inflating m, but still returned in the ranking with adjusted_p_value=1.0, selected=False. Closes polis-community#73
Major update: Re-add Agora implementation as recommended pipelineThis PR has grown beyond the original repness bug fixes. It now includes a full Agora implementation that addresses the fundamental limitations of Polis statement selection, plus documentation positioning Agora as the recommended default. Why we moved beyond the existing Polis algorithmThe original work on this branch fixed several Polis repness bugs (repful_for using wrong test statistics, best-agree heuristic using disagree data for agree statements, format_comment_stats outputting wrong fields). But as we dug deeper, the problems weren't just bugs — they were structural limitations of the heuristic approach:
Why keep both implementations?Polis stays as-is (with targeted bug fixes) because it's the reference baseline. We want users, builders and researchers to be able to verify their results against stock Polis output, as it is the de-facto standard. And compare with the newer approaches. Agora is separate and shares the same core pipeline (PCA + KMeans) but replaces statement selection and consensus scoring with principled statistics. Also applies but fixes. More to come. The journey to Simes' combinationAgora uses Benjamini-Hochberg FDR control instead of
The key insight: probability and representativeness tests are positively dependent (groups that agree more show higher probability AND higher representativeness). Simes' method is proven valid under positive dependence (Sarkar & Chang 1997), making it the theoretically correct choice. For comparison, Polis selects 5/1/5 on the same data — Agora's 4/0/8 is in the same ballpark but adapts to the data rather than using a fixed cap. What's in this commitPolis fixes (in
Agora implementation (new):
Documentation:
Closes #73 |
Summary
Re-adds the Agora implementation as the recommended default pipeline, with principled statistical methods replacing Polis's ad-hoc heuristics. Also fixes several bugs in the Polis repness selection.
Polis bug fixes (in
select_representative_statements)Agora implementation (new)
rank_representative_statements()— ranks ALL statements per group by effect size, with BH FDR selection using Simes' p-value combinationrank_consensus_statements()— ranks ALL statements by pa/pd with BH selectioncompute_effective_agreement_gac()—prod(pa*(1-pd))^(1/n)penalizes divided groupsapply_bh_with_vote_filter()— shared BH helper excluding zero-vote statementsAgoraClusteringResultwith ranked_repness, ranked_consensus, effective GACDocumentation
agora-demo.ipynbnotebook as recommended quickstartCloses #73
Supersedes #105
Original PR description (August 2025)
Fixed:
TODO: