fix(#22): fall back to abstract-only when ar5iv has no full-text render by Vigtu · Pull Request #26 · p7dotorg/paper7

Vigtu · 2026-05-15T11:15:35Z

Summary

When ar5iv.getHtml(id) fails with Ar5ivDecodeError (the stub-without-<article> case) or Ar5ivHttpError (e.g., 404), fetchAr5ivHtmlOrFallback swallows the error, logs a warning via Effect.logWarning, and the fetch chain produces abstract-only markdown from the already-fetched arxiv metadata.
Ar5ivTimeoutError and Ar5ivTransientError keep their existing fail-loud behaviour.
Reduced markdown is cached normally; user re-runs paper7 kb ingest <id> when they want to retry ar5iv.

Why

paper7 kb ingest 2306.06891 (the issue example, a Findings ACL 2023 paper) hits ar5iv, gets a 200 response with an 8KB "Untitled Document" stub, fails the article-tag check, and abandons the ingest — even though arxiv.get already returned the full title/authors/abstract. There is no path through the current code to land that paper in cache without manually downloading the PDF. With this PR the paper lands as a reduced (abstract-only) cache entry and the user keeps moving.

Design note: ACL Anthology branch dropped

The original spec proposed an ar5iv → ACL Anthology HTML → abstract-only chain via a DBLP lookup. Empirical probing during implementation showed DBLP indexes the arxiv preprint and the published ACL version under different keys — searching by arxiv id (q=2306.06891) returns only the CoRR record, not the conference record. The ACL branch would need a two-step lookup (arxiv id → title → ACL) with marginal coverage gain over abstract-only (which already gives the user a working result). Skipping it keeps the fallback chain to one branch and removes an external dependency on DBLP.

Fallback truth table

ar5iv failure mode	Behaviour
`Ar5ivDecodeError` (no `<article>` tag)	warn → abstract-only
`Ar5ivHttpError` (e.g., 404)	warn → abstract-only
`Ar5ivTimeoutError`	fail with `GetAr5ivError` (user retries)
`Ar5ivTransientError`	fail with `GetAr5ivError` (already retries internally)
ar5iv returns valid HTML	canonical markdown (no regression)

Test plan

npx vitest run → 171 pass / 0 fail.
npx tsc --noEmit clean.
New tests: falls back to abstract-only when ar5iv has no full-text render covers both Ar5ivDecodeError and Ar5ivHttpError paths via the rootCommand integration; does not fall back when ar5iv times out confirms timeout still surfaces; new yieldable failure test confirms Ar5ivTimeoutError still wraps in GetAr5ivError.
Replaced the old yields GetAr5ivError with nested Ar5ivDecodeError test (which asserted the now-removed behaviour) with falls back to abstract-only on Ar5ivDecodeError.

Closes #22.

…ext render ar5iv returns HTTP 200 with a 'Untitled Document' stub (no <article> tag) for papers it never rendered — ACL findings short papers and some older arxiv entries are the recurring cases. Previously this bubbled up as Ar5ivDecodeError and killed kb ingest terminally, even though arxiv metadata (title, authors, abstract) was already fetched successfully. When the ar5iv branch fails with Ar5ivDecodeError or Ar5ivHttpError, the fetch chain now silently switches to abstract-only markdown built from the already-fetched arxiv metadata, logs a warning, and caches the reduced result. The paper still lands in cache; the user can re-run kb ingest later if ar5iv catches up. Network-flavoured ar5iv errors (Ar5ivTimeoutError, Ar5ivTransientError) keep their existing fail-loud behaviour — the assumption there is 'try again later', not 'this paper has no render'. Dropped the DBLP-based ACL Anthology fallback that the original spec proposed: DBLP indexes the arxiv preprint and the published ACL version under different keys, and a single search by arxiv id returns only the CoRR record. The ACL branch would have needed a two-step lookup with marginal coverage gain — empirically not worth the fragility. Abstract-only is the safe lowest common denominator. Closes p7dotorg#22.

EduSantosBrito

Effect-idiomatic review: this one looks good overall.

This PR is heading in the right direction for Effect: it recovers only from the ar5iv failures that are semantically recoverable (Ar5ivDecodeError / Ar5ivHttpError) and leaves timeout/transient failures in the error channel. That distinction is exactly the kind of typed-error boundary Effect is good at expressing.

A small cleanup I would suggest, mostly to make the intent clearer for future readers:

Ar5ivClient.use((ar5iv) => ar5iv.getHtml(id)).pipe(
  Effect.catchTags({
    Ar5ivDecodeError: () => fallback,
    Ar5ivHttpError: () => fallback
  }),
  Effect.mapError((error) => new GetAr5ivError({ error }))
)

Why: Effect.catchTags documents at the type level that only those tags are recovered. Any unlisted ar5iv error remains in the error channel, so the compiler and reader both see that timeout/transient errors still fail loud.

Also consider returning Option.Option<string> instead of string | undefined from fetchAr5ivHtmlOrFallback. Option is the more idiomatic Effect representation for "the full text may be absent", while undefined is just a JS sentinel. Not required for this PR, but it would make the fallback path more explicit.

Verification: I checked this PR locally with tsc --noEmit and the get tests; they pass.

EduSantosBrito · 2026-05-21T12:23:15Z

Follow-up: small Effect API suggestions for this PR.

Effect.catchTags

The current behavior is right. catchTags would just encode the intent more directly:

Ar5ivClient.use((ar5iv) => ar5iv.getHtml(id)).pipe(
  Effect.catchTags({
    Ar5ivDecodeError: () => fallbackToAbstractOnly(id),
    Ar5ivHttpError: () => fallbackToAbstractOnly(id)
  }),
  Effect.mapError((error) => new GetAr5ivError({ error }))
)

This makes it obvious that timeout/transient errors are not recovered.

Option

Option.Option<string> would communicate "full text may be unavailable" better than string | undefined:

const fetchAr5ivHtmlOrFallback = (...): Effect.Effect<Option.Option<string>, GetError, Ar5ivClient>

Then the markdown builder can Option.match between abstract-only and canonical markdown.

Optional: Effect.fn

fetchAr5ivHtmlOrFallback is a reusable business operation and a nice candidate for Effect.fn("fetchAr5ivHtmlOrFallback"). That gives better tracing/stack context as this flow grows.

Replace broad Effect.catch with Effect.catchTags so only Ar5ivDecodeError and Ar5ivHttpError trigger the abstract-only fallback. Timeout and transient errors stay in the typed error channel as before, now visible to the compiler and the reader.

Vigtu · 2026-05-21T21:32:09Z

Addressed the catchTags suggestion in 996cb36 — Effect.catchTags({ Ar5ivDecodeError, Ar5ivHttpError }) so timeout/transient ar5iv errors stay typed in the error channel.

Skipped the optional Option<string> change to keep the diff minimal per the "not required for this PR" note; happy to follow up separately if you'd like it.

EduSantosBrito reviewed May 21, 2026

View reviewed changes

Vigtu requested a review from EduSantosBrito May 21, 2026 21:33

EduSantosBrito approved these changes May 22, 2026

View reviewed changes

lucianfialho merged commit d622521 into p7dotorg:main May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#22): fall back to abstract-only when ar5iv has no full-text render#26

fix(#22): fall back to abstract-only when ar5iv has no full-text render#26
lucianfialho merged 2 commits into
p7dotorg:mainfrom
Vigtu:fix/issue-22-ar5iv-fallback

Vigtu commented May 15, 2026

Uh oh!

EduSantosBrito left a comment

Uh oh!

EduSantosBrito commented May 21, 2026

Uh oh!

Vigtu commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Vigtu commented May 15, 2026

Summary

Why

Design note: ACL Anthology branch dropped

Fallback truth table

Test plan

Uh oh!

EduSantosBrito left a comment

Choose a reason for hiding this comment

Uh oh!

EduSantosBrito commented May 21, 2026

Uh oh!

Vigtu commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants