Skip to content

feat: re-add Agora as recommended pipeline, fix Polis repness bugs#99

Merged
nicobao merged 25 commits into
polis-community:mainfrom
nicobao:fix-repful-for
Mar 12, 2026
Merged

feat: re-add Agora as recommended pipeline, fix Polis repness bugs#99
nicobao merged 25 commits into
polis-community:mainfrom
nicobao:fix-repful-for

Conversation

@nicobao
Copy link
Copy Markdown
Member

@nicobao nicobao commented Aug 20, 2025

Updated 2026-03-12: This PR expanded from Polis repness bug fixes into the full Agora pipeline implementation. See this comment for the full technical rationale.

Summary

Re-adds the Agora implementation as the recommended default pipeline, with principled statistical methods replacing Polis's ad-hoc heuristics. Also fixes several bugs in the Polis repness selection.

Polis bug fixes (in select_representative_statements)

  • Fix repful_for calculation to use correct test statistics (pat/pdt)
  • Remove buggy best-agree/best-of-agrees heuristic (was using disagree data for agree statements)
  • Fix format_comment_stats to output correct direction fields
  • Filter out zero-vote statements from significance testing

Agora implementation (new)

  • rank_representative_statements() — ranks ALL statements per group by effect size, with BH FDR selection using Simes' p-value combination
  • rank_consensus_statements() — ranks ALL statements by pa/pd with BH selection
  • compute_effective_agreement_gac()prod(pa*(1-pd))^(1/n) penalizes divided groups
  • apply_bh_with_vote_filter() — shared BH helper excluding zero-vote statements
  • AgoraClusteringResult with ranked_repness, ranked_consensus, effective GAC

Documentation

  • README: Agora as recommended default, Polis vs Agora comparison table, roadmap
  • API reference: Agora functions and types, fix base section bug
  • CHANGELOG: all additions documented
  • agora-demo.ipynb notebook as recommended quickstart
  • Cram snapshot tests with Polis vs Agora selection comparison

Closes #73
Supersedes #105

Original PR description (August 2025)

Fixed:

  • representative opinion used to be formatted wrongly (using repful_for=disagree data instead of agree for example), especially when it comes to the "best-agree"
  • the way to select representative opinions and then select them for formatting was different, so it was leading to errors.
  • we're now sorting representative opinions by repness-test before using pick-max so we're sure we have the best ones

TODO:

  • best-agree support was temporarily removed for now, as the implementation was flawed
  • instead of the previous implementation we should select the best "agree" of the existing selected representative opinions after pick_max filter, if any, and then update the column to add "best-agree: true". There might be not best-agree sometimes when tehre is no "agree" representative opinions, and it's expected according to my experience I think sometimes there is no best-agree at all (it also makes sense in general I think, but I may be wrong)
  • add unit tests!

@nicobao
Copy link
Copy Markdown
Member Author

nicobao commented Aug 20, 2025

Test fails because best-agree was removed

@nicobao nicobao changed the title fix: representative opinions selection and stat formatting fix: representative opinions selection and comment stats formatting Aug 20, 2025
@nicobao
Copy link
Copy Markdown
Member Author

nicobao commented Aug 25, 2025

Hi @patcon, any feedback, so we can merge?

@nicobao
Copy link
Copy Markdown
Member Author

nicobao commented Sep 5, 2025

Hey @patcon, could you tell me how to go through sufficient_statements to add the following:

best_agree.update({"n-agree": best_agree["n-success"], "best-agree": True})

to the one statement among all the sufficient_statements which have maximum repness_test among those with repful_for=agree (if any)?

And if we use best_overall (sufficient_statements is empty) then do the same for best_overall.

(I am a noob when it comes to working with tf data objects)

(As I said earlier, it's possible we don't have any best-agree at all with that method if we only have disagree representative opinions, but that seems fine to me?)

@nicobao
Copy link
Copy Markdown
Member Author

nicobao commented Sep 23, 2025

I need your help @patcon to understand why there are so many test errors

- Remove unused 'stat' import
- Simplify significance checks by removing redundant vote count validations
- Streamline repful_for calculation by removing nested conditionals
- Lower minimum confidence threshold from 0.7 to 0.6 for statement selection
- Improve confidence selection to prefer exact pick_max matches over near-misses
- Remove best-agree flag assignment logic
nicobao added 3 commits March 3, 2026 23:06
The fallback path in select_representative_statements() was returning
a raw pandas Series instead of a properly formatted PolisRepnessStatement
dict. This caused downstream consumers (e.g. Zod schema validation) to
reject the entire math update when any cluster triggered the fallback,
because the output had raw column names (na, nd, statement_id) instead
of the expected keys (tid, n-success, repful-for).

The fallback now:
- Formats through format_comment_stats() like the non-fallback path
- Handles best_overall=None by producing an empty list instead of [None]
The group-aware consensus score used raw p_agree per group in the
geometric mean, completely ignoring p_disagree. This meant a group
genuinely divided (similar levels of agree and disagree) contributed
the same score as an undivided group with the same agree level,
allowing divided groups to be masked by other groups' strong agreement.

Replace raw p_agree with "effective agreement": p_agree * (1 - p_disagree).
This discounts each group's agreement by its disagreement, so a divided
group naturally drags down the consensus score while still producing a
continuous ranking.

Also document all divergences from the original Polis algorithm:
- Geometric mean normalization (existing)
- Effective agreement (new)
- Progressive confidence lowering for representative statements (existing)
- Significance-based repful_for determination (existing)
## Why this change

The Polis repness algorithm has several issues that motivated both
targeted fixes to the Polis implementation and a new Agora
implementation that takes a fundamentally different approach.

### Problems with Polis repness selection

The stock Polis `select_representative_statements()` uses a cascade of
ad-hoc heuristics: `beats_best_of_agrees()`, then
`beats_best_by_repness_test()`, then `beats_best_of_agrees()` again,
capped by `pick_max=5`. This approach:

- Has no statistical foundation — `pick_max=5` is arbitrary and doesn't
  adapt to data. A conversation with 10 statements and one with 1000
  get the same cutoff.
- The `best-agree` heuristic was buggy — it used disagree data for agree
  statements due to a repful_for calculation error.
- The three-regime cascade makes it impossible to produce a single
  ranked list of all statements, which is what library users need to
  build their own UIs and selection logic.

### Polis bug fixes (in select_representative_statements)

These fix the existing Polis implementation directly:
- Fix repful_for calculation to use correct test statistics (pat/pdt)
- Remove buggy best-agree/best-of-agrees heuristic
- Simplify selection cascade
- Fix format_comment_stats to use correct direction fields
- Filter out statements with no agree or disagree votes

### Why we keep both implementations

The Polis implementation stays (with the above fixes) because it serves
as a **reference baseline** — users need to verify results match stock
Polis behavior. Agora is a separate implementation that shares the same
core pipeline (PCA + KMeans) but replaces selection and consensus with
principled statistics.

### Why Agora ranks ALL statements then selects

Instead of a black-box that returns 5 statements, Agora returns every
statement ranked by effect size with a `selected` flag from
Benjamini-Hochberg. This lets library consumers:
- Show the full ranking in a UI
- Apply their own selection criteria
- Understand why a statement was or wasn't selected (via adjusted
  p-values)

### Why Simes' p-value combination (not max, not Fisher)

Each statement has two test statistics: a probability test (is this
group's agreement rate significantly above 50%?) and a
representativeness test (does this group agree more than others?).
We need to combine them into one p-value for BH.

- `max(p_prob, p_rep)` (intersection test) — our first attempt. Too
  conservative: requires BOTH tests independently significant. With
  m=42 hypotheses and BH fdr=0.10, rank 1 threshold is only 0.0024
  (z>=2.81). Result: 0/0/1 statements selected across 3 groups.
- Fisher's method — assumes independence between tests. But our tests
  are positively dependent (groups that agree more show higher
  probability AND higher representativeness). Too lenient: 22/0/17.
- Simes' combination `min(2*p_min, p_max)` — valid under positive
  dependence (Sarkar & Chang 1997). This is exactly our case. Result:
  4/0/8 (vs Polis 5/1/5 — comparable but data-driven).

### Why effective agreement GAC

Polis GAC uses `prod(pa)^(1/n)` — geometric mean of agreement rates.
A group where 80% agree AND 60% disagree scores the same as one where
80% agree and 10% disagree. Agora uses `prod(pa*(1-pd))^(1/n)` to
discount agreement by disagreement, penalizing divided groups.

### Why zero-vote filtering (in both Polis and Agora)

Statements with na=0 and nd=0 get p-values from Laplace smoothing
noise. In Polis, they're now filtered by `is_statement_significant()`.
In Agora, they're excluded from BH hypothesis count via
`apply_bh_with_vote_filter()` to avoid inflating m, but still returned
in the ranking with adjusted_p_value=1.0, selected=False.

Closes polis-community#73
@nicobao
Copy link
Copy Markdown
Member Author

nicobao commented Mar 12, 2026

Major update: Re-add Agora implementation as recommended pipeline

This PR has grown beyond the original repness bug fixes. It now includes a full Agora implementation that addresses the fundamental limitations of Polis statement selection, plus documentation positioning Agora as the recommended default.

Why we moved beyond the existing Polis algorithm

The original work on this branch fixed several Polis repness bugs (repful_for using wrong test statistics, best-agree heuristic using disagree data for agree statements, format_comment_stats outputting wrong fields). But as we dug deeper, the problems weren't just bugs — they were structural limitations of the heuristic approach:

  1. pick_max=5 is arbitrary. A conversation with 10 statements and one with 1000 get the same cutoff. There's no statistical basis for "5".

  2. The three-regime cascade (beats_best_of_agreesbeats_best_by_repness_testbeats_best_of_agrees) is opaque. It's impossible to extract a single ranked list from it. Library users who want to build their own UIs or apply custom selection criteria are stuck with a black box that returns 5 statements.

  3. The GAC formula prod(pa)^(1/n) ignores disagreement. A group split 80% agree / 60% disagree gets the same consensus score as 80% agree / 10% disagree. Divided groups should score lower.

Why keep both implementations?

Polis stays as-is (with targeted bug fixes) because it's the reference baseline. We want users, builders and researchers to be able to verify their results against stock Polis output, as it is the de-facto standard. And compare with the newer approaches.

Agora is separate and shares the same core pipeline (PCA + KMeans) but replaces statement selection and consensus scoring with principled statistics. Also applies but fixes. More to come.

The journey to Simes' combination

Agora uses Benjamini-Hochberg FDR control instead of pick_max. Each statement has two test statistics (probability test + representativeness test) that need combining into one p-value. We explored three approaches:

Method Assumption Result (3 groups) Verdict
max(p_prob, p_rep) (intersection) None 0/0/1 selected Too conservative — requires BOTH tests independently significant
Fisher's χ² = -2Σln(p) Independence 22/0/17 selected Too lenient — our tests aren't independent
Simes' min(2·p_min, p_max) Positive dependence 4/0/8 selected Just right — valid for our case

The key insight: probability and representativeness tests are positively dependent (groups that agree more show higher probability AND higher representativeness). Simes' method is proven valid under positive dependence (Sarkar & Chang 1997), making it the theoretically correct choice.

For comparison, Polis selects 5/1/5 on the same data — Agora's 4/0/8 is in the same ballpark but adapts to the data rather than using a fixed cap.

What's in this commit

Polis fixes (in select_representative_statements):

  • Fix repful_for to use correct test statistics (pat/pdt)
  • Remove buggy best-agree/best-of-agrees heuristic
  • Fix format_comment_stats to use correct direction fields
  • Filter out zero-vote statements from significance testing

Agora implementation (new):

  • rank_representative_statements() — ranks ALL statements by effect size, BH FDR selection with Simes' p-value combination
  • rank_consensus_statements() — ranks ALL statements by pa/pd, BH FDR selection
  • compute_effective_agreement_gac()prod(pa*(1-pd))^(1/n) penalizes divided groups
  • apply_bh_with_vote_filter() — shared BH helper excluding zero-vote statements
  • AgoraClusteringResult dataclass with ranked_repness, ranked_consensus, effective GAC

Documentation:

  • README updated: Agora as recommended default, Polis vs Agora comparison table
  • API reference: Agora functions and types documented
  • CHANGELOG: full list of additions
  • agora-demo.ipynb notebook as recommended quickstart
  • Cram snapshot tests for Agora pipeline output with Polis comparison

Closes #73

@nicobao nicobao changed the title fix: representative opinions selection and comment stats formatting feat: re-add Agora as recommended pipeline, fix Polis repness bugs Mar 12, 2026
@nicobao nicobao merged commit 8bd5881 into polis-community:main Mar 12, 2026
8 checks passed
@nicobao nicobao deleted the fix-repful-for branch March 12, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Agora pipeline — rank all statements with BH selection, fix Polis repness bugs

1 participant