📝 Refactor pairwise comparison

davidgasquez · davidgasquez · commit 8d4338ff40b3 · 2025-11-20T18:38:19.000+01:00
diff --git a/Deep Funding.md b/Deep Funding.md
@@ -94,31 +94,10 @@ Once the competition ends, extra comparisons could be gathered for projects that
 
 - Set a consensus over which meta-mechanism is used to evaluate weights (e.g: Brier Score). Judged/rank mechanism/models solely on their performance against the rigorous pre-built eval set. No subjective opinions. Just a leaderboard of the most aligned weight distributions.
   - Win rates can be derived from pairwise comparisons
-- No intensity, just more good old pairwise comparisons!
-  - Intensity [requires global knowledge](https://xkcd.com/883/), has interpersonal scales, and humans are incoherent when assigning them (even in the same order of magnitude).
-  - Make it easy and smooth for people to make their comparisons. Use LLM suggestions, good UX with details, remove any friction, and get as many as possible. Filter after the fact using heuristics or something simpler like a whitelist. If there is a test set (labels from people the org trust), evaluate against that to choose the best labelers.
-  - Fields that use pairwise comparisons.
-    - Psychology (psychometrics) trying to predict the latent utilities of items
-    - Consumer science doing preference testing
-    - Decision analysis
-    - Marketing (also with top-k or best-worst method)
-    - Recommendation systems
-    - Sports (elo)
-    - RLHF
-  - Pairwise comparisons make choices a simple decision (yes / no, this or that). No one knows what 3.4x better means
-  - Occam's razor works here too: simple things generalize better
-  - Intensity makes the distribution curve arbitrary
-- We should test the assumption expert jurors give good results. Jurors are messy and not well calibrated. Collecting more information from "expert" jurors will probably add more noise. We should instead assume noisy jurors and use techniques to deal with that.
-  - There are better and more modern methods to derive weights from [noisy pairwise comparisons](https://arxiv.org/abs/2510.09333) ([from multiple annotators](https://arxiv.org/abs/1612.04413))
-  - [Detect and correct for evaluators' bias in the task of ranking items from pairwise comparisons](https://link.springer.com/article/10.1007/s10618-024-01024-z)
-- Use active ranking or dueling bandits to [speed up the data gathering process](https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-6/Active-ranking-from-pairwise-comparisons-and-when-parametric-assumptions-do/10.1214/18-AOS1772.pdf)
-  - Stop with a "budget stability" rule (expected absolute dollar change from one more batch is less than a threshold)
+- Lean on the [[Pairwise Comparisons]] playbook (binary questions over intensity, active sampling, filtering noisy raters) for any human labeling.
 - Do some post-processing to the weights:
   - Report accuracy/Brier and use paired bootstrap to see if gap is statistically meaningful
   - If gaps are not statistically meaningful, bucket rewards (using Zipf's law) so it feels fair
-- If anyone (or jury selection is more relaxed) can rate you can remove low-quality raters with heuristics or pick only the best N raters (crowd BT)
-  - Crowdsourced annotators are often unreliable, effectively [integrating multiple noisy labels to produce accurate annotations stands as arguably the most important consideration for designing and implementing a crowdsourcing system](https://arxiv.org/pdf/2407.06902).
-- To gather more comparisons, a top-k method could be used instead of pairwise. Show 6 projects. Ask for the top 3 (no need to order them).
 - How would things look like if they were [Bayesian Bradley Terry](https://erichorvitz.com/crowd_pairwise.pdf) instead of [classic Bradley-Terry](https://gwern.net/resorter)? Since comparisons are noisy and we have unreliable jurors, can we [compute distributions instead of "skills"](https://github.com/max-niederman/fullrank)?
 - Instead of one canonical graph, allow different stakeholder groups (developers, funders, users) to maintain their own weight overlays on the same edge structure. Aggregate these views using quadratic or other mechanisms
   - If there is a plurality of these "dependency graphs" (or just different set of weights), the funding organization can choose which one to use! The curators gain a % of the money for their service. This creates a market-like mechanism that incentivizes useful curation.
diff --git a/Impact Evaluators.md b/Impact Evaluators.md
@@ -63,14 +63,8 @@ It's hard to do [[Public Goods Funding]], open-source software, research, etc. t
   - Gather objective attestations about work (commits, usage stats, dependencies).
   - Apply multiple "evaluation lenses" to interpret the data.
   - Let funders choose which lenses align with their values.
-  - When collecting data, [pairwise comparisons and rankings are more reliable than absolute scoring](https://anishathalye.com/designing-a-better-judging-system/).
-    - Humans excel at relative judgments, but struggle with absolute judgments.
-    - [Many algorithms can be used to convert pairwise comparisons into absolute scores](https://crowd-kit.readthedocs.io/en/latest/) and [ranked lists](https://github.com/dustalov/evalica).
-    - Pairwise shines when all the context is in the UX.
-    - [Data is good at providing comprehensive coverage of things that are countable. Data is bad at dealing with nuances and qualitative concepts that experts intuitively understand.](https://gov.optimism.io/t/lessons-learned-from-two-years-of-retroactive-public-goods-funding/9239)
-    - Crowds bring natural diversity and help capture human semantics. [Disagreement is signal, not just noise](https://github.com/CrowdTruth/CrowdTruth-core/blob/master/tutorial/Part%20I_%20CrowdTruth%20Tutorial.pdf). There are niches of experts in the crowds.
-    - Collecting good pairwise data [is similar to collecting good ML/AI training data](https://github.com/cleanlab/cleanlab).
-    - [The RLHF fields also deals with this issue](https://mlhp.stanford.edu/src/chap6.html).
+  - Prefer [[Pairwise Comparisons]] for human input over absolute scoring; standard methods turn pairs into scores/ranks and handle noisy raters.
+  - [Data is good at providing comprehensive coverage of things that are countable. Data is bad at dealing with nuances and qualitative concepts that experts intuitively understand.](https://gov.optimism.io/t/lessons-learned-from-two-years-of-retroactive-public-goods-funding/9239)
 - **Design for composability**. Define clear data structures (graphs, weight vectors) as APIs between layers.
   - Multiple communities could share measurement infrastructure.
   - Different evaluation methods can operate on the same data.
diff --git a/Pairwise Comparisons.md b/Pairwise Comparisons.md
@@ -0,0 +1,26 @@
+# Pairwise Comparisons
+
+Pairwise comparisons are any processes of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical. They are useful when aggregating human preferences.
+
+## Why Pairwise
+
+- Humans are better at relative judgments than absolute scoring. Scale-free comparisons reduce calibration headaches ([better judging systems](https://anishathalye.com/designing-a-better-judging-system/)).
+- Simple binary choices reduce cognitive load and make participation easier than intensity sliders.
+- Works across domains (psychometrics, recsys, sports, RLHF) where latent utility is hard to measure directly ([pairwise calibrated rewards](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf)).
+- Disagreement is signal. Diversity in raters surfaces semantics a single expert might miss ([CrowdTruth](https://github.com/CrowdTruth/CrowdTruth-core/blob/master/tutorial/Part%20I_%20CrowdTruth%20Tutorial.pdf)).
+
+## Collecting Good Data
+
+- Keep the UX fast and low-friction. Suggest options, keep context in the UI, and let people expand only if they want.
+- Avoid intensity questions. They are order-dependent and [require global knowledge](https://xkcd.com/883/).
+- Use [active sampling](https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-6/Active-ranking-from-pairwise-comparisons-and-when-parametric-assumptions-do/10.1214/18-AOS1772.pdf)/dueling bandits to focus on informative pairs. Stop when marginal value drops.
+- [Top-k tasks](https://proceedings.mlr.press/v84/heckel18a.html) can scale collection (pick best 3 of 6) while still convertible to pairwise data.
+- Expect [noisy raters](https://arxiv.org/abs/1612.04413). Filter or reweight after the fact using heuristics or gold questions instead of overfitting to ["experts" biases](https://link.springer.com/article/10.1007/s10618-024-01024-z).
+
+## Aggregation and Evaluation
+
+- There are many aggregation/eval rules; [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model), [Huber in log-space](https://en.wikipedia.org/wiki/Huber_loss), [Brier](https://en.wikipedia.org/wiki/Brier_score), ...
+- Converting pairs into scores or rankings is standard; start with Elo/Bradley-Terry (or crowd-aware variants) before custom models.
+- Use robust methods (crowd BT, hierarchical BT, [Bayesian variants](https://erichorvitz.com/crowd_pairwise.pdf)) to correct annotator bias and uncertainty.
+- Expert jurors can be inconsistent, biased, and expensive. [Large graphs of comparisons](https://arxiv.org/pdf/1505.01462) are needed to tame variance.
+- You can report accuracy/Brier by using [bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)).