📝 Feedback considerations in Public Goods Funding

davidgasquez · davidgasquez · commit 2306b0c961d9 · 2025-11-21T09:49:41.000+01:00
diff --git a/Deep Funding.md b/Deep Funding.md
@@ -80,7 +80,7 @@ Like in the current setup, a DAG of projects is needed. The organizers publish t
 
 Once participants have worked on their models and send/trade their predictions, the "evaluated project" list is revealed and only those projects are used to evaluate weights' predictions. Best strategy is to price truthfully all items. The question here is: how can we evaluate only a few projects without jurors giving a connected graph to the rest of the projects?
 
-Since we don't have a global view (no interconnected graph), we need to use comparative and scale-free metrics. Metrics like the [Brier score](https://en.wikipedia.org/wiki/Brier_score) or methods like [Bradley Terry](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf) can be used to evaluate any model or trader's weights ([in that case you're fitting just a single global scale or temperature parameter to minimize negative log-likelihood](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-3-reward-modeling-human-preferences/reward-model-calibration))!
+Since we don't have a global view (no interconnected graph), we need to use comparative and scale-free metrics. Metrics like the [Brier score](https://en.wikipedia.org/wiki/Brier_score) or methods like [Bradley Terry](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf) can be used to evaluate any model or trader's weights ([in that case you're fitting just a single global scale or temperature parameter to minimize negative log-likelihood](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-3-reward-modeling-human-preferences/reward-model-calibration))! Don't give the mechanism explicit instructions. Just give it a goal (and compute rewards) and let it figure out the best strategy.
 
 Once the best model is chosen (the one that acts closest to the chosen subset of pairwise comparisons), the same pairwise comparisons can be used [to adjust the scale of the weight distribution](https://proceedings.mlr.press/v70/guo17a/guo17a.pdf). That means the market resolution uses only the subset (for payouts to traders) but the funding distribution uses the model's global ranking with its probabilities calibrated to the subset via a single scalar _a_ that pins the entire slate to the same scale that was verified by real judgments. The **jurors** pairwise comparisons can even be "merged" with the best model to incorporate all data in there.
 
diff --git a/Machine Learning.md b/Machine Learning.md
@@ -4,7 +4,7 @@
 
 1. Frame the problem. Define a clear and concise objective with clear metrics. [Write it as a design doc](https://applyingml.com/resources/ml-design-docs/). To know "what it is good enough" you have to collect and annotate more data than most people and organizations want to do.
 1. Get the data. Make the data tidy. Machine learning models are only as reliable as the data used to train them. [The data matters more than the model](https://twitter.com/beeonaposy/status/1353735905962577920). Data matters more than the model. [The main bottleneck is collecting enough high quality data and getting it properly annotated and verified](https://news.ycombinator.com/item?id=45875618). Then doing proper evals with humans in the loop to get it right.
-1. Explore the data. Verify any assumptions. Garbage in, garbage out.
+1. Explore the data. Verify any assumptions. Garbage in, garbage out. Remove ALL friction from looking at data.
 1. Create a model. [Start with the simplest model!](https://developers.google.com/machine-learning/guides/rules-of-ml/). That will be the [baseline model](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa). Evaluate the model with the defined metric.
 1. Make sure everything works end to end. _You design it, you train it, you deploy it_. [Deploy the model quickly](https://nlathia.github.io/2019/08/Machine-learning-faster.html) and automatically. Add a clear description of the model. [Monitor model performance in production](https://youtu.be/hqxQO7MoQIE).
 1. Make results (models, analysis, graphs, ...) reproducible (code, environment and data). Version your code, data and configuration. Make feature dependencies explicit in the code. Separate code from configuration.
@@ -16,10 +16,31 @@
 
 These points are expanded with more details in courses like [Made With ML](https://madewithml.com/).
 
-## Tips
+## Evals
 
-- Use pre-commit hooks. Start with the basics — black, isort — then add pydocstyle, mypy, check-ast, ...
-- Version your data! Don't overwrite raw datasets.
+> Don't hope for "great", specify it, measure it, and improve toward it!
+
+- Evals make fuzzy goals and abstract ideas specific and explicit. They help you systematically measure and improve a system.
+- Evals are a key set of tools and methods to measure and improve the ability of an AI system to meet expectations.
+- [Success with AI hinges on how fast you can iterate](https://hamel.dev/blog/posts/evals/#iterating-quickly-success). You must have processes and tools for evaluating quality (tests), debugging issues (logging, inspecting data), and changing the behavior or the system (prompt eng, fine-tuning, writing code).
+- Collecting good evals will make you understand the problem better.
+- Working with probabilistic systems requires new kinds of measurement and deeper consideration of trade-offs.
+- Don't work if you cannot define what "great" means for your use case.
+
+### The [Eval Loop](https://openai.com/index/evals-drive-next-chapter-of-ai/)
+
+1. **Specify**.
+  - Define what "great" means.
+  - Write down the purpose of your AI system in plain terms.
+  - The resulting golden set of examples should be a living, authoritative reference of your most skilled experts' judgement and taste for what "great" looks like.
+  - The process is iterative and messy.
+2. **Measure**
+  - Test against real-world conditions. Reliably surface concrete examples of how and when the system is failing.
+  - Use examples drawn from real-world situations whenever possible.
+3. **Improve**
+  - Learn from errors.
+  - Addressing problems uncovered by your eval can take on many forms: refining prompts, adjusting data access, updating the eval itself to better reflect your goals, ...
+  -
 
 ### ML In Production Resources
 
diff --git a/Public Goods Funding.md b/Public Goods Funding.md
@@ -7,6 +7,9 @@ Public goods are defined as goods that are both non-excludable (it's infeasible
 - [[Plurality|Different public goods require different funding approaches based on their characteristics and communities]].
 - Mathematical optimality matters less than perceived fairness and historical precedent. Ideal funding methods that don't work in practice are not ideal.
 - Many [[Mechanism Design|mechanisms]] which satisfy different constraints have already been discovered, and it seems unlikely that a different approach will radically change the landscape. Instead, the **bottleneck seems to be in popularizing and scaling existing mechanisms**.
+- Retrospective evaluation is often easier than prospective funding. [[Impact Evaluators]] and retroactive public goods funding reward **verifiable impact after the fact** instead of just predictions about future impact.
+- Effective funding systems usually start small and local, with tight feedback loops and clear community ownership, and only then generalize once patterns are proven.
+- The funding **infrastructure itself is a public good**. Data, evaluation pipelines, and mechanisms should be open, composable, and forkable so communities can reuse and adapt them.
 
 ## Desirable Criteria
 
@@ -18,8 +21,14 @@ Public goods are defined as goods that are both non-excludable (it's infeasible
 - **Provable Participation**. Even if spending should be kept private, users may want to prove their participation in a funding mechanism in order to boost their reputation or as part of an agreement.
 - **Identity and Reputation**. To prevent sybil attacks, some form of identity is needed. If reputation is important, a public identity is preferred. If anonymity is required, zero-knowledge proofs or re-randomizable encryption may be necessary. Reputation is an important incentive to fund public goods. Some form of reputation score or record of participation can be useful for repeated games. These scores can help identify bad actors or help communities coalesce around a particular funding venue. [Identity-free mechanism can also be used](https://victorsintnicolaas.com/funding-public-goods-in-identity-free-systems/).
 - **Verifiable Mechanisms**. Users may want certain guarantees about a mechanism before or after participation, especially if the mechanism being used is concealed. Ex-ante, they may want to upper-bound their amount of spending towards the good, ex-post, they may require proof that a sufficient number of individuals contributed.
-- **Anti-Collusion Infrastructure**. Like secure voting [[Systems]], there is a threat of buying votes in a funding mechanism. Collusion can be discouraged by making it impossible for users to prove how they reported their preferences. This infrastructure must be extended to prevent collusion between the 3rd party and the users.
+- **Anti-Collusion Infrastructure**. Like secure voting systems, there is a threat of buying votes in a funding mechanism. Collusion can be discouraged by making it impossible for users to prove how they reported their preferences. This infrastructure must be extended to prevent collusion between the 3rd party and the users.
 - **Predictable Schedules**. Participants need to know when are they getting funded.
+- **Simplicity and Legibility**. The simpler a mechanism (fewer parameters, clear rules, open-source and publicly verifiable execution), the less space there is for hidden privilege, corruption, and overfitting, and the easier it is for people to understand and engage with it.
+- **Anti-Goodhart Resilience**. Any metric used for decisions will be gamed. Mechanisms should assume this, incorporate **feedback loops and error analysis**, and make it easy to update or combine metrics and evaluators when they drift from what really matters.
+- **Plurality and Forkability**. No single mechanism can satisfy all desirable properties in all contexts. Systems should support **multiple evaluators and preference-aggregation methods**, and allow communities to fork and adapt criteria when they disagree.
+- **Composable Data and Evaluation Layers**. Separate **data collection** (attestations about work, usage, dependencies, etc.) from **judgment** (how that data is weighted). Multiple evaluation "lenses" (models, juries, dashboards) should be able to operate on the same shared data structures (graphs, weight vectors).
+- **Exploration vs Exploitation**. Funding mechanisms are optimization processes that tend to exploit known winners. Some budget should be reserved for **exploration of uncertain, high-variance public goods**, not just those that already score well on existing metrics.
+- **Community Feedback and Local Control**. Mechanisms should include channels for participants to flag problems, suggest changes, and adjust evaluation criteria. Small, local experiments with clear consent and ownership are often the safest way to evolve funding systems.
 
 ## Methods
 
@@ -37,9 +46,9 @@ The [S-Process (Simulation Process)](https://www.youtube.com/watch?v=jWivz6KidkI
 
 When a project turns out to be great, pay both the builders and the early funders. People who repeatedly back the right projects end up with more money and can fund more next time.
 
-Builders create public goods (OSS, research, infrastructure, etc.), funders chooses and puts money in them. Retrospective rounds are made with any [[Impact Evaluators]] mechanism. Projects and funders are rewarded accordingly.
+Builders create public goods (OSS, research, infrastructure, etc.), funders choose and put money in them. Retrospective rounds are made with any [[Impact Evaluators]] mechanism. Projects and funders are rewarded accordingly.
 
-Each cycle has tro phases: Funding and Retro. The funding phase is where funders give money to any projects they decide to back. They recive "retro shares" for that project. Basically, Who backed what, and by how much? The retrospective phase is where the system rewards impact. A mechanism is run (e.g: [[Deep Funding]]) that returns a version of impact for each project.
+Each cycle has two phases: Funding and Retro. The funding phase is where funders give money to any projects they decide to back. They receive "retro shares" for that project. Basically, who backed what, and by how much? The retrospective phase is where the system rewards impact. A mechanism is run (e.g: [[Deep Funding]]) that returns a version of impact for each project.
 
 Capital concentrates in the hands of those who were repeatedly "right" about which public goods mattered.