📝 Expand prompt and eval guidance

davidgasquez · davidgasquez · commit 7795b5e5e543 · 2025-12-04T11:29:29.000+01:00
diff --git a/Artificial Intelligence Models.md b/Artificial Intelligence Models.md
@@ -56,6 +56,7 @@
 
 - Using LLMs for coding is difficult and unintuitive, requiring significant effort to master.
 - English is becoming the hottest new programming language. [Use it](https://addyo.substack.com/p/the-70-problem-hard-truths-about).
+  - [Prompts are code](https://mariozechner.at/posts/2025-06-02-prompts-are-code/). Markdown and JSON files are state.
 - Use comments to guide the model to do what you want.
 - Don't delegate thinking, delegate work.
 - Describe the problem very clearly and effectively.
@@ -66,6 +67,8 @@
 - Provide the desired function signatures, API, or docs. Apply the TDD loop and make the model write tests and the code until the tests pass.
 - Prioritize exploration over execution (at first). Iterate towards precision during the brainstorming phase. Start fresh when switching to execution.
 - Many LLMs now have very large context windows, but filling them with irrelevant code or conversation can confuse the model. Above about 25k tokens of context, most models start to become distracted and become less likely to conform to their system prompt.
+- [Use Progressive Disclosure](https://www.humanlayer.dev/blog/writing-a-good-claude-md) to ensure that the agent only sees tasks or project-specific instructions when it needs them.
+- Prefer pointers to files than copies (no code snippets, ...).
 - Make the model ask you more questions to refine the ideas.
 - Take advantage of the fact that [redoing work is extremely cheap](https://crawshaw.io/blog/programming-with-llms).
 - If you want to force some "reasoning", ask something like "[is that a good suggestion?](https://news.ycombinator.com/item?id=42894688)" or "propose a variety of suggestions for the problem at hand and their trade-offs".
diff --git a/Machine Learning.md b/Machine Learning.md
@@ -26,6 +26,11 @@ These points are expanded with more details in courses like [Made With ML](https
 - Collecting good evals will make you understand the problem better.
 - Working with probabilistic systems requires new kinds of measurement and deeper consideration of trade-offs.
 - Don't work if you cannot define what "great" means for your use case.
+- [Evals replace LGTM-vibes development](https://newsletter.pragmaticengineer.com/p/evals). They systematize quality when outputs are non-deterministic.
+- [Error analysis](https://youtu.be/ORrStCArmP4) workflow: build a simple trace viewer, review ~100 traces, annotate the first upstream failure ([open coding](https://shribe.eu/open-coding/)), cluster into themes ([axial coding](https://delvetool.com/blog/openaxialselective)), and use counts to prioritize. Bootstrap with grounded synthetic data if real data is thin.
+- Pick the right evaluator: code-based assertions for deterministic failures; LLM-as-judge for subjective ones. Keep labels binary (PASS/FAIL) with human critiques. Partition data so the judge cannot memorize answers; validate the judge against human labels (TPR/TNR) before trusting it.
+- Run evals in CI/CD and keep monitoring with production data.
+  - [Analyze → measure → improve → automate → repeat](https://newsletter.pragmaticengineer.com/p/evals).
 - Good eval metrics:
   - Measure an error you've observed.
   - Relates to a non-trivial issue you will iterate on.