feat(#23): kb ingest accepts multiple ids and prints a batch summary#27
Conversation
…summary
paper7 kb ingest used to take exactly one identifier. Running 10-15
ingests in a research session meant launching the command per paper,
then diffing paper7 kb list against the expected id list to find the
failures, then retrying by hand.
The argument is now variadic via Argument.variadic({ min: 1 }). Single-id
behaviour is preserved exactly — the paper's markdown still streams to
stdout, so existing pipes keep working. With two or more ids, a new
runKbIngestBatch path takes over: it ingests serially (arxiv enforces a
~3s rate limit and S2 caps at ~1 req/s on the unauth tier; concurrency
buys 429s rather than throughput), and prints one summary block:
Ingested: N/M papers to <sources-dir>
Failed:
<id> — <reason>
Parse failures, network errors, and cache errors all land in the
Failed: list with a per-id reason. The batch exits 0 as long as at
least one paper landed; if every id failed the new KbIngestBatchFailed
error fires and the process exits 1 with 'error: all kb ingests failed'
on stderr while the summary still goes to stdout.
The renderer is intentionally terse — soft fallbacks from PR3
(ar5iv → abstract-only) print their own warnings via Effect.logWarning
during ingest and count toward Ingested:, so the summary just reports
the final tally.
Closes p7dotorg#23.
EduSantosBrito
left a comment
There was a problem hiding this comment.
Effect-idiomatic review: requesting changes.
The feature is useful, and keeping the batch sequential is a reasonable default. Effect's forEach defaults to concurrency 1, so the implementation is sequential as intended. The main changes I would ask for are about preserving Effect boundaries.
Why this matters in Effect: an Effect program should keep domain work, typed failures, and runtime/CLI concerns separate. That is what makes the code composable, testable, and easy to reason about. If a domain function prints to the console and catches every error as a string, it becomes much harder to reuse or inspect with Effect's typed model.
Concrete concerns:
runKbIngestBatchlogs fromsrc/kb.ts:
yield* Console.log(renderBatchSummary(attempts, paths.sources))Existing runKb returns a string and commands/kb.ts handles Console.log. I would keep that boundary: have runKbIngestBatch return a KbIngestBatchResult, then render/log it in src/commands/kb.ts. The domain module should describe the batch result; the command adapter should decide how to print it.
ingestOneForBatchcatches everyKbError | GetErrorand converts it to{ kind: "fail", reason: string }.
Effect.catch((error) =>
Effect.succeed({ kind: "fail", raw, reason: batchErrorReason(error) })
)That erases the typed error channel. In Effect, broad catches are best kept at clear boundaries. Here, only errors that are genuinely per-id/recoverable should become failed entries. Shared or systemic failures should remain typed failures. For example, a write/permission problem in the wiki directory is probably not "one id failed"; it may mean the whole batch cannot write.
Suggested shape:
- Parse failures can become per-id failures.
- Expected per-id
Get*failures can become per-id failures if that is the desired UX. - Keep the typed error in the
BatchAttemptpayload as long as possible, and render it only in the CLI layer. - Let systemic
KbIoError/ setup failures fail the command normally.
For example:
type BatchAttempt =
| { readonly _tag: "Ingested"; readonly raw: string }
| { readonly _tag: "Failed"; readonly raw: string; readonly error: BatchIngestError }Then commands/kb.ts can render the summary and format BatchIngestError for humans.
This keeps the nice batch UX while preserving the main Effect benefit: typed, explicit recoverability instead of catch-all stringification.
Verification: I checked this PR locally with tsc --noEmit and the kb tests; they pass.
|
Follow-up: underused Effect APIs that would help this batch implementation.
For batch work, this is often better than catching everything into a string. It lets each per-id attempt become success/failure data while preserving the typed error: const attempts = yield* Effect.forEach(rawIds, (raw) =>
ingestOne(raw).pipe(Effect.result)
)Then the command layer can use
If only some failures should be per-id failures, catch just those tags: ingest(id).pipe(
Effect.map(() => ({ _tag: "Ingested", raw })),
Effect.catchTags({
GetArxivError: (error) => Effect.succeed({ _tag: "Failed", raw, error }),
GetAr5ivError: (error) => Effect.succeed({ _tag: "Failed", raw, error }),
GetPubmedError: (error) => Effect.succeed({ _tag: "Failed", raw, error }),
GetCrossrefError: (error) => Effect.succeed({ _tag: "Failed", raw, error })
})
)Leave systemic errors like
Current sequential behavior is fine: Effect.forEach(rawIds, ingestOneForBatch, { concurrency: 2 })No custom queue needed unless the policy becomes more complex. |
Pull rendering and the final fail-decision out of src/kb.ts so the domain module returns data and the CLI adapter decides how to present it. runKbIngestBatch now returns KbIngestBatchResult (attempts + sourcesDir); src/commands/kb.ts logs the summary, formats per-id errors, and raises KbIngestBatchFailed when every id failed. Narrow the per-id error boundary with Effect.catchTags. Only the four external fetch failures (GetArxivError, GetAr5ivError, GetPubmedError, GetCrossrefError) are converted to per-id Failed entries; KbIoError and the rest of GetError stay in the typed error channel so a wiki write failure / disk-full / permission problem still fails the whole batch loudly instead of being silently reported as a skipped paper. BatchAttempt now carries the typed BatchIngestError payload (new KbInvalidIdentifier tag covers unparseable raw ids), and the CLI renderer is the only place that stringifies it. KbIngestBatchFailed is raised with bare 'yield* new KbIngestBatchFailed(...)' per repo convention.
|
Addressed both points in f78532a: 1. Domain / CLI boundary — type KbIngestBatchResult = {
readonly attempts: ReadonlyArray<BatchAttempt>
readonly sourcesDir: string
}
2. Narrowed catchTags + typed payload — replaced the blanket
Also dropped the
|
Summary
paper7 kb ingestargument is now variadic viaArgument.variadic({ min: 1 }).runKbIngestBatchpath takes over: serial ingest, per-id catch, single summary block on stdout.KbIngestBatchFailedtagged error so the process exits 1 when every id in a batch fails.Why
Running 10–15 ingests in a research session means launching the command per paper, then diffing
paper7 kb listagainst the expected id list to find the failures, then retrying by hand. Variadic + summary collapses three steps into one.Batch output shape
Hard failures (unparseable identifier, network/HTTP error, cache write error) hit the
Failed:block with a per-id reason. Soft fallbacks from PR3 (ar5iv → abstract-only) emit their own warnings viaEffect.logWarningduring ingest and count towardIngested:, so the summary just reports the final tally.Exit codes
error: all kb ingests failedon stderr.Concurrency
Sequential, intentionally. arxiv export API enforces ~3s/request rate limiting and S2 caps at ~1 req/s on the unauth tier — concurrency buys 429s, not throughput. A
--concurrencyflag would be future work behind a retry strategy.Test plan
npx vitest run→ 171 pass / 0 fail.npx tsc --noEmitclean.Ingested: 2/2, no markdown in output), one valid + one bogus (Ingested: 1/2+Failed: bogus.id — invalid identifier), all bogus (exit 1 + summary on stdout + error on stderr).Closes #23.