fix(#22): fall back to abstract-only when ar5iv has no full-text render#26
Conversation
…ext render ar5iv returns HTTP 200 with a 'Untitled Document' stub (no <article> tag) for papers it never rendered — ACL findings short papers and some older arxiv entries are the recurring cases. Previously this bubbled up as Ar5ivDecodeError and killed kb ingest terminally, even though arxiv metadata (title, authors, abstract) was already fetched successfully. When the ar5iv branch fails with Ar5ivDecodeError or Ar5ivHttpError, the fetch chain now silently switches to abstract-only markdown built from the already-fetched arxiv metadata, logs a warning, and caches the reduced result. The paper still lands in cache; the user can re-run kb ingest later if ar5iv catches up. Network-flavoured ar5iv errors (Ar5ivTimeoutError, Ar5ivTransientError) keep their existing fail-loud behaviour — the assumption there is 'try again later', not 'this paper has no render'. Dropped the DBLP-based ACL Anthology fallback that the original spec proposed: DBLP indexes the arxiv preprint and the published ACL version under different keys, and a single search by arxiv id returns only the CoRR record. The ACL branch would have needed a two-step lookup with marginal coverage gain — empirically not worth the fragility. Abstract-only is the safe lowest common denominator. Closes p7dotorg#22.
EduSantosBrito
left a comment
There was a problem hiding this comment.
Effect-idiomatic review: this one looks good overall.
This PR is heading in the right direction for Effect: it recovers only from the ar5iv failures that are semantically recoverable (Ar5ivDecodeError / Ar5ivHttpError) and leaves timeout/transient failures in the error channel. That distinction is exactly the kind of typed-error boundary Effect is good at expressing.
A small cleanup I would suggest, mostly to make the intent clearer for future readers:
Ar5ivClient.use((ar5iv) => ar5iv.getHtml(id)).pipe(
Effect.catchTags({
Ar5ivDecodeError: () => fallback,
Ar5ivHttpError: () => fallback
}),
Effect.mapError((error) => new GetAr5ivError({ error }))
)Why: Effect.catchTags documents at the type level that only those tags are recovered. Any unlisted ar5iv error remains in the error channel, so the compiler and reader both see that timeout/transient errors still fail loud.
Also consider returning Option.Option<string> instead of string | undefined from fetchAr5ivHtmlOrFallback. Option is the more idiomatic Effect representation for "the full text may be absent", while undefined is just a JS sentinel. Not required for this PR, but it would make the fallback path more explicit.
Verification: I checked this PR locally with tsc --noEmit and the get tests; they pass.
|
Follow-up: small Effect API suggestions for this PR.
The current behavior is right. Ar5ivClient.use((ar5iv) => ar5iv.getHtml(id)).pipe(
Effect.catchTags({
Ar5ivDecodeError: () => fallbackToAbstractOnly(id),
Ar5ivHttpError: () => fallbackToAbstractOnly(id)
}),
Effect.mapError((error) => new GetAr5ivError({ error }))
)This makes it obvious that timeout/transient errors are not recovered.
const fetchAr5ivHtmlOrFallback = (...): Effect.Effect<Option.Option<string>, GetError, Ar5ivClient>Then the markdown builder can
|
Replace broad Effect.catch with Effect.catchTags so only Ar5ivDecodeError and Ar5ivHttpError trigger the abstract-only fallback. Timeout and transient errors stay in the typed error channel as before, now visible to the compiler and the reader.
|
Addressed the catchTags suggestion in 996cb36 — Skipped the optional |
Summary
ar5iv.getHtml(id)fails withAr5ivDecodeError(the stub-without-<article>case) orAr5ivHttpError(e.g., 404),fetchAr5ivHtmlOrFallbackswallows the error, logs a warning viaEffect.logWarning, and the fetch chain produces abstract-only markdown from the already-fetched arxiv metadata.Ar5ivTimeoutErrorandAr5ivTransientErrorkeep their existing fail-loud behaviour.paper7 kb ingest <id>when they want to retry ar5iv.Why
paper7 kb ingest 2306.06891(the issue example, a Findings ACL 2023 paper) hits ar5iv, gets a 200 response with an 8KB "Untitled Document" stub, fails the article-tag check, and abandons the ingest — even thougharxiv.getalready returned the full title/authors/abstract. There is no path through the current code to land that paper in cache without manually downloading the PDF. With this PR the paper lands as a reduced (abstract-only) cache entry and the user keeps moving.Design note: ACL Anthology branch dropped
The original spec proposed an
ar5iv → ACL Anthology HTML → abstract-onlychain via a DBLP lookup. Empirical probing during implementation showed DBLP indexes the arxiv preprint and the published ACL version under different keys — searching by arxiv id (q=2306.06891) returns only the CoRR record, not the conference record. The ACL branch would need a two-step lookup (arxiv id → title → ACL) with marginal coverage gain over abstract-only (which already gives the user a working result). Skipping it keeps the fallback chain to one branch and removes an external dependency on DBLP.Fallback truth table
Ar5ivDecodeError(no<article>tag)Ar5ivHttpError(e.g., 404)Ar5ivTimeoutErrorGetAr5ivError(user retries)Ar5ivTransientErrorGetAr5ivError(already retries internally)Test plan
npx vitest run→ 171 pass / 0 fail.npx tsc --noEmitclean.falls back to abstract-only when ar5iv has no full-text rendercovers bothAr5ivDecodeErrorandAr5ivHttpErrorpaths via the rootCommand integration;does not fall back when ar5iv times outconfirms timeout still surfaces; new yieldable failure test confirmsAr5ivTimeoutErrorstill wraps inGetAr5ivError.yields GetAr5ivError with nested Ar5ivDecodeErrortest (which asserted the now-removed behaviour) withfalls back to abstract-only on Ar5ivDecodeError.Closes #22.