This is a curated reading list for the problems Happy Paths targets:
- repeated wrong-turns in agentic coding loops
- cross-session / cross-developer reuse (memory)
- correctness vs cost/time ("thinking budget") tradeoffs
- realistic evaluation of software engineering agents
- developer productivity measurement
Happy Paths is not trying to be a research project, but we do want our claims (and our measurement choices) to be legible to people who have followed this literature.
-
SWE-bench (2023): real GitHub issues as a benchmark.
- Paper: https://arxiv.org/abs/2310.06770
- Website/viewer: https://www.swebench.com/
-
SWE-bench Verified (2024): tighter verification harness + public results.
-
SWE-agent (NeurIPS 2024): agent-computer interfaces + guardrails for SWE-bench-style tasks.
-
Agentless: an “agentless” approach to solving software development problems.
-
SWE-SEARCH (ICLR 2025): using search/retrieval to improve SWE performance.
- Paper (ICLR proceedings PDF): https://proceedings.iclr.cc/paper_files/paper/2025/file/a1e6783e4d739196cad3336f12d402bf-Paper-Conference.pdf
-
SWE-Effi (2025): re-evaluates SWE agent effectiveness under resource constraints.
-
RepoBench (ICLR 2024): repository-level code completion.
-
AgentBench (ICLR 2024): broader benchmark for LLMs as agents (not coding-only).
-
ABC-Bench (2026): agentic backend coding tasks.
-
ProjDevBench (2026): end-to-end project development benchmark for coding agents.
-
ACE-Bench (2026): agentic coding in end-to-end development of complex features.
- OpenReview: https://openreview.net/forum?id=41xrZ3uGuI
-
Structured Context Engineering for File-Native Agentic Systems (2026): schema accuracy + multi-file navigation.
-
Reflexion (2023): language agents that self-improve via verbal feedback.
-
MemGPT (2023): explicit external memory management for LLM agents.
-
A-Mem: Agentic Memory for LLM Agents (2025).
-
Voyager (2023): lifelong skill library + curriculum in an open-ended environment.
These are relevant because “agentic coding” often fails not for lack of capability, but because the loop burns too much time/tokens getting to the right state.
-
Scaling LLM Test-Time Compute Optimally… (2024).
-
Token-Budget-Aware LLM Reasoning (ACL Findings 2025).
- Paper (PDF): https://aclanthology.org/2025.findings-acl.1274.pdf
-
Steering LLM Thinking with Budget Guidance (2025).
-
ThinkPrune: pruning long chain-of-thought.
-
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot (2023).
-
GitHub’s own writeup of productivity/happiness study results:
A recurring theme in community discussion is that assistants shift the bottleneck from writing to reviewing / understanding (especially for large diffs). We want our metrics to reflect that reality.
-
Towards Decoding Developer Cognition in the Age of AI Assistants (2025).
-
Understanding user mental models in AI-driven code completion tools (2025).
-
Human-AI Experience in Integrated Development Environments: A systematic literature review (2025).
We also track practitioner discussion to understand what actually bottlenecks engineering teams when they adopt these tools.
-
SWE-agent open source (benchmark realism, “bug report quality” debates):
-
Devin announcement (scope limits, “90% correct is not good enough”):
-
Aider thread (API cost, large-task iteration loops):
-
Claude Code (multi-session orchestration, workflow discussion):
-
Claude Code velocity discussion: bottleneck shift from typing to understanding/review.
-
AI coding assistant comparisons (Cursor/Aider/Cline/Copilot, etc.):
If you think a paper/thread belongs here (especially anything that quantifies cost/time or failure modes), please open a PR adding it.