Skip to content

Commit 8d4338f

Browse files
committed
πŸ“ Refactor pairwise comparison
1 parent 3574691 commit 8d4338f

File tree

3 files changed

+29
-30
lines changed

3 files changed

+29
-30
lines changed

β€ŽDeep Funding.mdβ€Ž

Lines changed: 1 addition & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -94,31 +94,10 @@ Once the competition ends, extra comparisons could be gathered for projects that
9494

9595
- Set a consensus over which meta-mechanism is used to evaluate weights (e.g: Brier Score). Judged/rank mechanism/models solely on their performance against the rigorous pre-built eval set. No subjective opinions. Just a leaderboard of the most aligned weight distributions.
9696
- Win rates can be derived from pairwise comparisons
97-
- No intensity, just more good old pairwise comparisons!
98-
- Intensity [requires global knowledge](https://xkcd.com/883/), has interpersonal scales, and humans are incoherent when assigning them (even in the same order of magnitude).
99-
- Make it easy and smooth for people to make their comparisons. Use LLM suggestions, good UX with details, remove any friction, and get as many as possible. Filter after the fact using heuristics or something simpler like a whitelist. If there is a test set (labels from people the org trust), evaluate against that to choose the best labelers.
100-
- Fields that use pairwise comparisons.
101-
- Psychology (psychometrics) trying to predict the latent utilities of items
102-
- Consumer science doing preference testing
103-
- Decision analysis
104-
- Marketing (also with top-k or best-worst method)
105-
- Recommendation systems
106-
- Sports (elo)
107-
- RLHF
108-
- Pairwise comparisons make choices a simple decision (yes / no, this or that). No one knows what 3.4x better means
109-
- Occam's razor works here too: simple things generalize better
110-
- Intensity makes the distribution curve arbitrary
111-
- We should test the assumption expert jurors give good results. Jurors are messy and not well calibrated. Collecting more information from "expert" jurors will probably add more noise. We should instead assume noisy jurors and use techniques to deal with that.
112-
- There are better and more modern methods to derive weights from [noisy pairwise comparisons](https://arxiv.org/abs/2510.09333) ([from multiple annotators](https://arxiv.org/abs/1612.04413))
113-
- [Detect and correct for evaluators' bias in the task of ranking items from pairwise comparisons](https://link.springer.com/article/10.1007/s10618-024-01024-z)
114-
- Use active ranking or dueling bandits to [speed up the data gathering process](https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-6/Active-ranking-from-pairwise-comparisons-and-when-parametric-assumptions-do/10.1214/18-AOS1772.pdf)
115-
- Stop with a "budget stability" rule (expected absolute dollar change from one more batch is less than a threshold)
97+
- Lean on the [[Pairwise Comparisons]] playbook (binary questions over intensity, active sampling, filtering noisy raters) for any human labeling.
11698
- Do some post-processing to the weights:
11799
- Report accuracy/Brier and use paired bootstrap to see if gap is statistically meaningful
118100
- If gaps are not statistically meaningful, bucket rewards (using Zipf's law) so it feels fair
119-
- If anyone (or jury selection is more relaxed) can rate you can remove low-quality raters with heuristics or pick only the best N raters (crowd BT)
120-
- Crowdsourced annotators are often unreliable, effectively [integrating multiple noisy labels to produce accurate annotations stands as arguably the most important consideration for designing and implementing a crowdsourcing system](https://arxiv.org/pdf/2407.06902).
121-
- To gather more comparisons, a top-k method could be used instead of pairwise. Show 6 projects. Ask for the top 3 (no need to order them).
122101
- How would things look like if they were [Bayesian Bradley Terry](https://erichorvitz.com/crowd_pairwise.pdf) instead of [classic Bradley-Terry](https://gwern.net/resorter)? Since comparisons are noisy and we have unreliable jurors, can we [compute distributions instead of "skills"](https://github.com/max-niederman/fullrank)?
123102
- Instead of one canonical graph, allow different stakeholder groups (developers, funders, users) to maintain their own weight overlays on the same edge structure. Aggregate these views using quadratic or other mechanisms
124103
- If there is a plurality of these "dependency graphs" (or just different set of weights), the funding organization can choose which one to use! The curators gain a % of the money for their service. This creates a market-like mechanism that incentivizes useful curation.

β€ŽImpact Evaluators.mdβ€Ž

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,8 @@ It's hard to do [[Public Goods Funding]], open-source software, research, etc. t
6363
- Gather objective attestations about work (commits, usage stats, dependencies).
6464
- Apply multiple "evaluation lenses" to interpret the data.
6565
- Let funders choose which lenses align with their values.
66-
- When collecting data, [pairwise comparisons and rankings are more reliable than absolute scoring](https://anishathalye.com/designing-a-better-judging-system/).
67-
- Humans excel at relative judgments, but struggle with absolute judgments.
68-
- [Many algorithms can be used to convert pairwise comparisons into absolute scores](https://crowd-kit.readthedocs.io/en/latest/) and [ranked lists](https://github.com/dustalov/evalica).
69-
- Pairwise shines when all the context is in the UX.
70-
- [Data is good at providing comprehensive coverage of things that are countable. Data is bad at dealing with nuances and qualitative concepts that experts intuitively understand.](https://gov.optimism.io/t/lessons-learned-from-two-years-of-retroactive-public-goods-funding/9239)
71-
- Crowds bring natural diversity and help capture human semantics. [Disagreement is signal, not just noise](https://github.com/CrowdTruth/CrowdTruth-core/blob/master/tutorial/Part%20I_%20CrowdTruth%20Tutorial.pdf). There are niches of experts in the crowds.
72-
- Collecting good pairwise data [is similar to collecting good ML/AI training data](https://github.com/cleanlab/cleanlab).
73-
- [The RLHF fields also deals with this issue](https://mlhp.stanford.edu/src/chap6.html).
66+
- Prefer [[Pairwise Comparisons]] for human input over absolute scoring; standard methods turn pairs into scores/ranks and handle noisy raters.
67+
- [Data is good at providing comprehensive coverage of things that are countable. Data is bad at dealing with nuances and qualitative concepts that experts intuitively understand.](https://gov.optimism.io/t/lessons-learned-from-two-years-of-retroactive-public-goods-funding/9239)
7468
- **Design for composability**. Define clear data structures (graphs, weight vectors) as APIs between layers.
7569
- Multiple communities could share measurement infrastructure.
7670
- Different evaluation methods can operate on the same data.

β€ŽPairwise Comparisons.mdβ€Ž

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Pairwise Comparisons
2+
3+
Pairwise comparisons are any processes of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical. They are useful when aggregating human preferences.
4+
5+
## Why Pairwise
6+
7+
- Humans are better at relative judgments than absolute scoring. Scale-free comparisons reduce calibration headaches ([better judging systems](https://anishathalye.com/designing-a-better-judging-system/)).
8+
- Simple binary choices reduce cognitive load and make participation easier than intensity sliders.
9+
- Works across domains (psychometrics, recsys, sports, RLHF) where latent utility is hard to measure directly ([pairwise calibrated rewards](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf)).
10+
- Disagreement is signal. Diversity in raters surfaces semantics a single expert might miss ([CrowdTruth](https://github.com/CrowdTruth/CrowdTruth-core/blob/master/tutorial/Part%20I_%20CrowdTruth%20Tutorial.pdf)).
11+
12+
## Collecting Good Data
13+
14+
- Keep the UX fast and low-friction. Suggest options, keep context in the UI, and let people expand only if they want.
15+
- Avoid intensity questions. They are order-dependent and [require global knowledge](https://xkcd.com/883/).
16+
- Use [active sampling](https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-6/Active-ranking-from-pairwise-comparisons-and-when-parametric-assumptions-do/10.1214/18-AOS1772.pdf)/dueling bandits to focus on informative pairs. Stop when marginal value drops.
17+
- [Top-k tasks](https://proceedings.mlr.press/v84/heckel18a.html) can scale collection (pick best 3 of 6) while still convertible to pairwise data.
18+
- Expect [noisy raters](https://arxiv.org/abs/1612.04413). Filter or reweight after the fact using heuristics or gold questions instead of overfitting to ["experts" biases](https://link.springer.com/article/10.1007/s10618-024-01024-z).
19+
20+
## Aggregation and Evaluation
21+
22+
- There are many aggregation/eval rules; [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model), [Huber in log-space](https://en.wikipedia.org/wiki/Huber_loss), [Brier](https://en.wikipedia.org/wiki/Brier_score), ...
23+
- Converting pairs into scores or rankings is standard; start with Elo/Bradley-Terry (or crowd-aware variants) before custom models.
24+
- Use robust methods (crowd BT, hierarchical BT, [Bayesian variants](https://erichorvitz.com/crowd_pairwise.pdf)) to correct annotator bias and uncertainty.
25+
- Expert jurors can be inconsistent, biased, and expensive. [Large graphs of comparisons](https://arxiv.org/pdf/1505.01462) are needed to tame variance.
26+
- You can report accuracy/Brier by using [bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)).

0 commit comments

Comments
Β (0)