Surface open questions in property catalog overview (#140)

SeanTAllen · web-flow · commit 0bf274837b10 · 2026-05-05T14:17:53.000-04:00
A per-property Open Questions list under each catalog entry lets reviewers
see at a glance which properties carry unresolved uncertainty, without
opening every evidence file. Property prose stays at the level of
generality the analysis supports; the list does the work of flagging
what's not yet pinned down. Discovery, evaluation, and workload flows
populate, sync, and read the list. The previous self-review criterion
required all open questions to be resolved — for projects with thin
documentation that's not realistic, so the criterion now requires
visibility plus an auditable investigation log instead.
diff --git a/antithesis-research/SKILL.md b/antithesis-research/SKILL.md
@@ -156,7 +156,8 @@ Review criteria:
 - Properties that need internal branch guidance or replay anchors call out likely SUT-side instrumentation points, not just workload-visible checks
 - `antithesis/scratchbook/deployment-topology.md` exists and describes a minimal container topology — every container is justified
 - Every cataloged property has a corresponding evidence file at `antithesis/scratchbook/properties/{slug}.md` that captures the evidence trail, relevant code paths, and key observations
-- Open questions in evidence files have been investigated and resolved — no unresolved questions remain in evidence files for active properties
+- Each property's catalog entry includes an Open Questions list that mirrors the unresolved questions from its evidence file (short summaries; full context stays in the evidence file). Resolved questions are removed; remaining questions are tagged per `references/property-catalog.md` ("Open Questions Conventions").
+- Each `(partial: ...)` or `(needs human input)` tag is backed by an investigation log entry in the evidence file (see `references/property-catalog.md` "Investigation Log") that records what was examined, what was found, and what wasn't. Untagged (not yet investigated) questions don't require a log entry. Unresolved questions that need human input are an acceptable outcome — they must be visible in the catalog overview, not buried in evidence files.
 - Properties invalidated by investigation are marked in the catalog with the reason
 - `antithesis/scratchbook/property-relationships.md` exists and groups related properties into clusters with brief notes on suspected connections and dominance
 - Every property slug referenced in `property-relationships.md` corresponds to a property in the catalog
diff --git a/antithesis-research/references/property-catalog.md b/antithesis-research/references/property-catalog.md
@@ -59,10 +59,70 @@ For each property, document:
 | **Invariant** | What Antithesis SDK assertion(s)? Be specific about what is checked and how. Include why that assertion type matches the property semantics. |
 | **Antithesis Angle** | How does fault injection interact with this? What timing/interleaving does it explore? |
 | **Why It Matters** | Real-world impact. Link to issues if applicable. |
+
+**Open Questions:**
+
+- Bullet list of unresolved questions about this property, or "None" if no questions remain. See "Open Questions Conventions" below for state tags and rules.
 ```
 
 For every property in the catalog, write a corresponding evidence file at `properties/{slug}.md`. See the "Evidence Files" section below.
 
+## Honest Summaries
+
+A property's catalog entry — its prose fields *and* its Open Questions list together — is the honest summary of what the analysis knows. The fields describe what the property is and why; the Open Questions list describes what's uncertain about it. Both parts do work. Don't overload one to compensate for the other.
+
+In particular, don't write a firm claim in **Property**, **Invariant**, **Antithesis Angle**, or **Why It Matters** that the Open Questions list contradicts. That's internally inconsistent, and a reviewer who notices will distrust the rest of the catalog. The fix is to write the prose at the level of generality the analysis actually supports — not to stuff parenthetical hedges into the field. If a key term in the property's prose has an unresolved meaning, the term can stay; the question goes in the Open Questions list and the reader knows to treat the term parametrically.
+
+The assertion type in **Invariant** (`Always`, `Sometimes(cond)`, etc.) is the one place where there's no parametric option: it's a discrete commitment. If you can't decide between two assertion types, the property isn't ready for the catalog — pick one (using "Choosing the Right Antithesis Assertion" below) or split into two properties.
+
+**Example.** The analysis identified an open question about what "acknowledged" means in this codebase but pinned down everything else about the property.
+
+Bad — the catalog entry hides the uncertainty. The reviewer sees a firm claim and assumes a specific definition of "acknowledged" that the analysis didn't actually verify:
+
+```
+| **Property** | Acknowledged writes survive leader failover. |
+
+**Open Questions:**
+
+- None
+```
+
+Good — same prose, but the Open Questions list does the honest work. The reader sees that "acknowledged" is contested and treats the property's claim parametrically:
+
+```
+| **Property** | Acknowledged writes survive leader failover. |
+
+**Open Questions:**
+
+- What does "acknowledged" mean in this codebase — fsync'd to leader, or replicated to quorum? `(needs human input)`
+```
+
+The Property prose is identical in both. The honest summary is the entry as a whole, including the list.
+
+## Open Questions Conventions
+
+The Open Questions list under each property gives a reviewer scanning the catalog at-a-glance signal for which properties have unresolved uncertainty. It complements the evidence file; it does not replace it.
+
+- **Summary, not verbatim:** each bullet is a short summary of one unresolved question. The full "why it matters / what changes depending on the answer" context lives in the evidence file (see "Evidence Files" below). The list points at the evidence file; it doesn't duplicate it.
+- **State tags:** tags describe how much investigation has happened — not what kind of question it is. Different kinds of questions deserve different follow-up (some need code reading, some need a stakeholder), but the list tracks status only.
+  - No tag — not yet investigated. This is the default at discovery output time.
+  - `(partial: <one-line summary>)` — investigation produced some findings but the core question remains open. Example: `(partial: confirmed write path uses fsync, replication path still unclear)`.
+  - `(needs human input)` — investigation completed and the question can't be resolved without information from a human (project owner, domain expert, etc.). The evidence file's investigation log records what was examined and what's missing.
+- **Exhaust investigation before tagging `(needs human input)`:** the tag means "I tried, the code/docs don't have the answer." The self-review criterion in `SKILL.md` requires the evidence file's investigation log to back this up. Skipping investigation and immediately tagging gets caught at review.
+- **Resolved questions drop out:** once a question is answered, remove it from the list. If the answer changes the property — different invariant, different assertion type, different scope — update the affected catalog fields accordingly. Even if the prose was already written parametrically (per "Honest Summaries"), the resolution often makes specific terms more precise; revise the prose if the new precision changes what the field claims.
+- **Priority is independent of open questions:** a high-priority property may have many open questions; a low-priority property may have none. Open Questions surface uncertainty about a property's claims; they don't lower its value. Don't deprioritize a property because its Open Questions list is busy.
+- **Per-property vs. catalog-wide:** this list is per-property — it holds questions about a specific property's claims. Catalog-wide questions (about the analysis itself, the deployment topology, the scope of testing) belong under the file-level `Open Questions` heading per `references/scratchbook-setup.md`.
+
+Example of a populated list (one bullet per state — untagged, partial, needs-human-input):
+
+```
+**Open Questions:**
+
+- Does the snapshot path retry on a partial write, or surface the error?
+- Is the post-restart index always monotonic? `(partial: confirmed for clean shutdown, crash path unclear)`
+- What does "acknowledged" mean in this codebase — fsync'd to leader, or replicated to quorum? `(needs human input)`
+```
+
 ## Choosing the Right Antithesis Assertion
 
 Antithesis assertions are not traditional crash-on-failure assertions. Failed Antithesis assertions do not terminate the program; they report property outcomes and guide exploration. That makes them safe to place in workload code and, when useful, production code paths. Outside Antithesis, SDKs are designed for minimal overhead and many languages support build-time or runtime disable modes.
@@ -119,6 +179,29 @@ When you encounter questions you can't answer within the scope of discovery, inc
 
 The goal is to preserve what you already know from your analysis. Don't do additional deep research for the evidence file — capture what you learned during discovery.
 
+### Investigation Log
+
+When an open question is investigated (per "Investigate Open Questions" in `references/property-discovery.md`), record what was examined under an `### Investigation Log` heading at the bottom of the evidence file. Use one `####` sub-heading per question (verbatim or close-paraphrase of the question text), with the body recording what was examined, found, and not found:
+
+```
+### Investigation Log
+
+#### What does "acknowledged" mean — fsync'd to leader, or replicated to quorum?
+
+- Examined: `internal/replication/leader.go`, `internal/wal/sync.go`, README "Durability" section, issue #482.
+- Found: leader writes to local WAL with fsync before responding; replication is async by default.
+- Not found: whether `--sync-replicate` flag changes the response point. The flag is referenced in tests but its semantics aren't documented.
+- Conclusion: tagged `(needs human input)` — owner needs to confirm intended behavior under `--sync-replicate`.
+```
+
+A few rules for the log:
+
+- **Persistence.** Log entries stay in the evidence file after the question resolves — they are audit trail, not working memory. They prove the criterion was met. A future reviewer or a re-investigation pass can read them.
+- **Multi-pass investigation.** If a question is re-investigated (e.g., during evaluation refinement or post-triage iteration), append a new section under the same `####` heading rather than rewriting. Date or context the new entry so the sequence is clear.
+- **Required vs. illustrative fields.** The Examined / Found / Not found / Conclusion structure is the recommended template; deviate when the question doesn't fit. The reviewer's bar is "I can tell what was tried and what was missing," not "I can tick four labeled fields."
+
+This makes the "attempted" check in `SKILL.md` self-review verifiable: a reviewer can confirm that each `(partial: ...)` or `(needs human input)` tag has a real investigation behind it.
+
 ## Output
 
 Write the catalog to `antithesis/scratchbook/property-catalog.md`. Include the provenance frontmatter (see "Provenance" section above).
diff --git a/antithesis-research/references/property-discovery.md b/antithesis-research/references/property-discovery.md
@@ -178,7 +178,18 @@ already exists in the codebase, is partially present, or is missing.
 
 **Returned to synthesis:**
 
-- Properties in catalog format (may be empty)
+- Properties in catalog format (may be empty). Populate the Open Questions list
+  directly from the questions written into the evidence file, following
+  `references/property-catalog.md` ("Open Questions Conventions"). At this
+  stage, questions are typically untagged — they haven't been investigated
+  yet. The synthesis "Investigate Open Questions" step adds tags.
+- If a question you raised would contradict a prose claim you'd otherwise
+  write in **Property**, **Invariant**, **Antithesis Angle**, or **Why It
+  Matters**, write the prose at the level of generality the analysis
+  supports — don't overcommit to an answer the Open Questions list says is
+  uncertain. The list does the work of surfacing uncertainty; the prose
+  doesn't need parenthetical hedges. See `references/property-catalog.md`
+  ("Honest Summaries").
 - For each property, the evidence file slug and a brief rationale summary —
   enough for synthesis to deduplicate, resolve conflicts, and evaluate assertion
   type, but not the full evidence narrative
@@ -212,21 +223,24 @@ After all agents complete, synthesize into a single property catalog:
   evidence, code paths, or failure mechanisms into clusters. Note any suspected
   dominance (where one property likely implies another). This is lightweight —
   flag connections you noticed during synthesis, don't do deep analysis.
-- **Investigate open questions:** For each property that has open questions in
-  its evidence file, spawn an agent to investigate. Run these in parallel — one
-  agent per property with open questions. Each agent receives the path to the
-  evidence file in the scratchbook, codebase access, and the path to
-  `antithesis/scratchbook/existing-assertions.md`. The agent reads the evidence
-  file's explanation of why each question matters, investigates the code to
-  answer it, and writes the updated evidence file back to the same path with
-  questions resolved — including correcting any instrumentation suggestions
-  against `existing-assertions.md` to accurately reflect what already exists vs.
-  what is missing. The agent returns a short structured summary (not the full
-  evidence file) describing: which questions were resolved, whether the property
-  changed (different invariant, different assertion type), and whether the
-  property was invalidated with the reason. After all agents return, review the
-  summaries. If any property was invalidated or changed, update the catalog and
-  re-evaluate property relationships accordingly.
+- **Investigate open questions:** see "Investigate Open Questions" below.
+
+#### Investigate Open Questions
+
+For each property that has open questions in its evidence file, spawn an agent to investigate. Run these in parallel — one agent per property with open questions.
+
+Each agent receives the path to the evidence file, codebase access, and the path to `antithesis/scratchbook/existing-assertions.md`.
+
+Each agent must:
+
+- Read the evidence file's explanation of why each question matters and investigate the code to answer it.
+- Not fabricate answers. If a question can't be resolved from code, docs, or other available evidence, leave it open with partial findings noted. Tag `(partial: ...)` when there are partial findings; tag `(needs human input)` only after exhausting investigation against code and docs. Some questions legitimately need human input — that is a normal outcome, not a failure of the step.
+- Record an investigation log in the evidence file under an `### Investigation Log` heading (see `references/property-catalog.md` "Investigation Log") so the "attempted" check in `SKILL.md` self-review is auditable.
+- Update the evidence file: remove resolved questions, note partial findings, apply human-input tags. Correct any instrumentation suggestions against `existing-assertions.md` to reflect what already exists vs. what is missing.
+
+Each agent returns a short structured summary (not the full evidence file) describing: which questions were resolved, which remain (and in what state), whether the property changed (different invariant, different assertion type), and whether the property was invalidated with the reason.
+
+After all agents return, review the summaries. Update the catalog whenever a property is invalidated, changed, or has remaining unresolved questions — sync the Open Questions list under each affected property from the evidence file, following `references/property-catalog.md` ("Open Questions Conventions"). If a resolution makes a parametric term in the prose more specific, refine the prose field accordingly. Re-evaluate property relationships if any property changed or was invalidated.
 
 ## Single-Agent Mode
 
@@ -258,12 +272,24 @@ focuses as a sequential checklist:
    lightweight — flag connections you noticed during the passes, don't do deep
    analysis.
 9. For each property with open questions in its evidence file, investigate the
-   code to answer them. Read the evidence file's explanation of why each
-   question matters, trace the code to answer it, and update the evidence file
-   with the answer and its implications — including correcting any instrumentation
-   suggestions against `existing-assertions.md`. If the answer changes the
-   property or invalidates it, update accordingly. Mark invalidated properties
-   in the catalog with the reason.
+   code to answer them:
+
+   - Read the evidence file's explanation of why each question matters and
+     trace the code to answer it.
+   - Don't fabricate answers. Tag `(partial: ...)` when partial findings exist;
+     tag `(needs human input)` only after exhausting investigation against code
+     and docs.
+   - Record an investigation log in the evidence file under an `### Investigation Log` heading (see `references/property-catalog.md` "Investigation Log") so
+     the "attempted" check in `SKILL.md` self-review is auditable.
+   - Update the evidence file with resolved answers and their implications,
+     including correcting any instrumentation suggestions against
+     `existing-assertions.md`.
+   - Sync the Open Questions list under each property from the evidence file,
+     following `references/property-catalog.md` ("Open Questions Conventions").
+     If a resolution makes a parametric term in the prose more specific,
+     refine the prose field accordingly.
+   - If an answer changes the property or invalidates it, update accordingly.
+     Mark invalidated properties in the catalog with the reason.
 
 Treat each pass (1–10) as a fresh examination. Resist the pull to skip a focus
 because earlier passes "already covered" that area — the point is to look at the
diff --git a/antithesis-research/references/property-evaluation.md b/antithesis-research/references/property-evaluation.md
@@ -234,7 +234,10 @@ Include each finding, its category, and the action to be taken.
 After categorization:
 
 1. **Refinements**: Apply directly. Update the affected properties in the
-   catalog, their evidence files, and property relationships if needed.
+   catalog, their evidence files, and property relationships if needed. When a
+   refinement modifies a property, ensure the Open Questions list under the
+   property stays in sync with its evidence file per
+   `references/property-catalog.md` ("Open Questions Conventions").
 
 2. **Gaps**: Spawn targeted discovery agents to fill them. Each agent receives:
    - The gap description from evaluation (used as a targeted attention focus)
@@ -246,9 +249,13 @@ After categorization:
    - The property catalog format from `references/property-catalog.md`
 
    Each agent returns properties in catalog format and writes evidence files.
-   Synthesize gap-fill properties into the catalog using the same
-   deduplication and merge process from property discovery synthesis. Update
-   property relationships with any new clusters.
+   Targeted discovery agents may carry forward unresolved questions in their
+   new evidence files; they populate the Open Questions list per
+   `references/property-catalog.md` ("Open Questions Conventions"). Synthesize
+   gap-fill properties into the catalog using the same deduplication and merge
+   process from property discovery synthesis; this includes reconciling the
+   Open Questions list under each new property. Update property relationships
+   with any new clusters.
 
 3. **Biases**: Collect and present to the human with the evaluation evidence.
    Include: the bias finding, the evidence from the evaluation agent(s), and
diff --git a/antithesis-workload/references/assertions.md b/antithesis-workload/references/assertions.md
diff --git a/antithesis-workload/references/iteration.md b/antithesis-workload/references/iteration.md