Skip to content

feat(#23): kb ingest accepts multiple ids and prints a batch summary#27

Merged
lucianfialho merged 2 commits into
p7dotorg:mainfrom
Vigtu:feat/issue-23-batch-ingest
May 22, 2026
Merged

feat(#23): kb ingest accepts multiple ids and prints a batch summary#27
lucianfialho merged 2 commits into
p7dotorg:mainfrom
Vigtu:feat/issue-23-batch-ingest

Conversation

@Vigtu

@Vigtu Vigtu commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • paper7 kb ingest argument is now variadic via Argument.variadic({ min: 1 }).
  • Single-id behaviour is unchanged: the paper's markdown still streams to stdout, preserving existing pipes.
  • With two or more ids, a new runKbIngestBatch path takes over: serial ingest, per-id catch, single summary block on stdout.
  • Adds KbIngestBatchFailed tagged error so the process exits 1 when every id in a batch fails.

Why

Running 10–15 ingests in a research session means launching the command per paper, then diffing paper7 kb list against the expected id list to find the failures, then retrying by hand. Variadic + summary collapses three steps into one.

Batch output shape

Ingested: N/M papers to <sources-dir>
Failed:
  <id> — <reason>

Hard failures (unparseable identifier, network/HTTP error, cache write error) hit the Failed: block with a per-id reason. Soft fallbacks from PR3 (ar5iv → abstract-only) emit their own warnings via Effect.logWarning during ingest and count toward Ingested:, so the summary just reports the final tally.

Exit codes

  • ≥1 paper ingested → exit 0, summary on stdout.
  • Zero ingested with at least one attempt → exit 1, summary on stdout, error: all kb ingests failed on stderr.

Concurrency

Sequential, intentionally. arxiv export API enforces ~3s/request rate limiting and S2 caps at ~1 req/s on the unauth tier — concurrency buys 429s, not throughput. A --concurrency flag would be future work behind a retry strategy.

Test plan

  • npx vitest run → 171 pass / 0 fail.
  • npx tsc --noEmit clean.
  • New tests cover: two valid ids (Ingested: 2/2, no markdown in output), one valid + one bogus (Ingested: 1/2 + Failed: bogus.id — invalid identifier), all bogus (exit 1 + summary on stdout + error on stderr).
  • Existing single-id test guards regression on the pipe-friendly path (paper markdown still on stdout).

Closes #23.

…summary

paper7 kb ingest used to take exactly one identifier. Running 10-15
ingests in a research session meant launching the command per paper,
then diffing paper7 kb list against the expected id list to find the
failures, then retrying by hand.

The argument is now variadic via Argument.variadic({ min: 1 }). Single-id
behaviour is preserved exactly — the paper's markdown still streams to
stdout, so existing pipes keep working. With two or more ids, a new
runKbIngestBatch path takes over: it ingests serially (arxiv enforces a
~3s rate limit and S2 caps at ~1 req/s on the unauth tier; concurrency
buys 429s rather than throughput), and prints one summary block:

  Ingested: N/M papers to <sources-dir>
  Failed:
    <id> — <reason>

Parse failures, network errors, and cache errors all land in the
Failed: list with a per-id reason. The batch exits 0 as long as at
least one paper landed; if every id failed the new KbIngestBatchFailed
error fires and the process exits 1 with 'error: all kb ingests failed'
on stderr while the summary still goes to stdout.

The renderer is intentionally terse — soft fallbacks from PR3
(ar5iv → abstract-only) print their own warnings via Effect.logWarning
during ingest and count toward Ingested:, so the summary just reports
the final tally.

Closes p7dotorg#23.

@EduSantosBrito EduSantosBrito left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effect-idiomatic review: requesting changes.

The feature is useful, and keeping the batch sequential is a reasonable default. Effect's forEach defaults to concurrency 1, so the implementation is sequential as intended. The main changes I would ask for are about preserving Effect boundaries.

Why this matters in Effect: an Effect program should keep domain work, typed failures, and runtime/CLI concerns separate. That is what makes the code composable, testable, and easy to reason about. If a domain function prints to the console and catches every error as a string, it becomes much harder to reuse or inspect with Effect's typed model.

Concrete concerns:

  1. runKbIngestBatch logs from src/kb.ts:
yield* Console.log(renderBatchSummary(attempts, paths.sources))

Existing runKb returns a string and commands/kb.ts handles Console.log. I would keep that boundary: have runKbIngestBatch return a KbIngestBatchResult, then render/log it in src/commands/kb.ts. The domain module should describe the batch result; the command adapter should decide how to print it.

  1. ingestOneForBatch catches every KbError | GetError and converts it to { kind: "fail", reason: string }.
Effect.catch((error) =>
  Effect.succeed({ kind: "fail", raw, reason: batchErrorReason(error) })
)

That erases the typed error channel. In Effect, broad catches are best kept at clear boundaries. Here, only errors that are genuinely per-id/recoverable should become failed entries. Shared or systemic failures should remain typed failures. For example, a write/permission problem in the wiki directory is probably not "one id failed"; it may mean the whole batch cannot write.

Suggested shape:

  • Parse failures can become per-id failures.
  • Expected per-id Get* failures can become per-id failures if that is the desired UX.
  • Keep the typed error in the BatchAttempt payload as long as possible, and render it only in the CLI layer.
  • Let systemic KbIoError / setup failures fail the command normally.

For example:

type BatchAttempt =
  | { readonly _tag: "Ingested"; readonly raw: string }
  | { readonly _tag: "Failed"; readonly raw: string; readonly error: BatchIngestError }

Then commands/kb.ts can render the summary and format BatchIngestError for humans.

This keeps the nice batch UX while preserving the main Effect benefit: typed, explicit recoverability instead of catch-all stringification.

Verification: I checked this PR locally with tsc --noEmit and the kb tests; they pass.

@EduSantosBrito

Copy link
Copy Markdown
Collaborator

Follow-up: underused Effect APIs that would help this batch implementation.

  1. Effect.result / Result.match

For batch work, this is often better than catching everything into a string. It lets each per-id attempt become success/failure data while preserving the typed error:

const attempts = yield* Effect.forEach(rawIds, (raw) =>
  ingestOne(raw).pipe(Effect.result)
)

Then the command layer can use Result.match to render the summary. This keeps domain errors typed until the CLI boundary.

  1. Effect.catchTags

If only some failures should be per-id failures, catch just those tags:

ingest(id).pipe(
  Effect.map(() => ({ _tag: "Ingested", raw })),
  Effect.catchTags({
    GetArxivError: (error) => Effect.succeed({ _tag: "Failed", raw, error }),
    GetAr5ivError: (error) => Effect.succeed({ _tag: "Failed", raw, error }),
    GetPubmedError: (error) => Effect.succeed({ _tag: "Failed", raw, error }),
    GetCrossrefError: (error) => Effect.succeed({ _tag: "Failed", raw, error })
  })
)

Leave systemic errors like KbIoError in the error channel.

  1. Effect.fn

runKbIngestBatch / ingestOneForBatch are reusable workflow functions and good candidates for Effect.fn("runKbIngestBatch") or Effect.fn("ingestOneForBatch"). It improves trace names and makes it clear these are Effect business operations.

  1. Effect.forEach concurrency option, if needed later

Current sequential behavior is fine: Effect.forEach defaults to concurrency 1. If you later add controlled parallelism, the idiomatic way is:

Effect.forEach(rawIds, ingestOneForBatch, { concurrency: 2 })

No custom queue needed unless the policy becomes more complex.

@Vigtu Vigtu requested a review from EduSantosBrito May 21, 2026 21:36
Pull rendering and the final fail-decision out of src/kb.ts so the
domain module returns data and the CLI adapter decides how to present
it. runKbIngestBatch now returns KbIngestBatchResult (attempts +
sourcesDir); src/commands/kb.ts logs the summary, formats per-id
errors, and raises KbIngestBatchFailed when every id failed.

Narrow the per-id error boundary with Effect.catchTags. Only the four
external fetch failures (GetArxivError, GetAr5ivError, GetPubmedError,
GetCrossrefError) are converted to per-id Failed entries; KbIoError
and the rest of GetError stay in the typed error channel so a wiki
write failure / disk-full / permission problem still fails the whole
batch loudly instead of being silently reported as a skipped paper.

BatchAttempt now carries the typed BatchIngestError payload (new
KbInvalidIdentifier tag covers unparseable raw ids), and the CLI
renderer is the only place that stringifies it. KbIngestBatchFailed is
raised with bare 'yield* new KbIngestBatchFailed(...)' per repo
convention.
@Vigtu

Vigtu commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Addressed both points in f78532a:

1. Domain / CLI boundaryrunKbIngestBatch no longer logs or decides exit. It now returns:

type KbIngestBatchResult = {
  readonly attempts: ReadonlyArray<BatchAttempt>
  readonly sourcesDir: string
}

src/commands/kb.ts is now the only place that calls Console.log, renders the summary, and raises KbIngestBatchFailed when no ids succeeded — same UX, but the domain module is reusable/testable in isolation.

2. Narrowed catchTags + typed payload — replaced the blanket Effect.catch with Effect.catchTags({ GetArxivError, GetAr5ivError, GetPubmedError, GetCrossrefError }). KbIoError (wiki dir write/perm), GetCacheWriteError, etc. now stay in the typed error channel so a systemic failure fails the whole batch instead of being reported as a skipped paper.

BatchAttempt carries the typed BatchIngestError (new KbInvalidIdentifier tag for unparseable raw ids); the CLI renderer is the only place that turns it into a display string.

Also dropped the Effect.fail(new KbIngestBatchFailed(...)) for bare yield* new KbIngestBatchFailed(...) to match the rest of the repo.

tsc --noEmit clean, all 171 tests pass.

@lucianfialho lucianfialho merged commit 50c5786 into p7dotorg:main May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: batch kb ingest with aggregate failure summary

3 participants