Skip to content

fix(#22): fall back to abstract-only when ar5iv has no full-text render#26

Merged
lucianfialho merged 2 commits into
p7dotorg:mainfrom
Vigtu:fix/issue-22-ar5iv-fallback
May 22, 2026
Merged

fix(#22): fall back to abstract-only when ar5iv has no full-text render#26
lucianfialho merged 2 commits into
p7dotorg:mainfrom
Vigtu:fix/issue-22-ar5iv-fallback

Conversation

@Vigtu

@Vigtu Vigtu commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • When ar5iv.getHtml(id) fails with Ar5ivDecodeError (the stub-without-<article> case) or Ar5ivHttpError (e.g., 404), fetchAr5ivHtmlOrFallback swallows the error, logs a warning via Effect.logWarning, and the fetch chain produces abstract-only markdown from the already-fetched arxiv metadata.
  • Ar5ivTimeoutError and Ar5ivTransientError keep their existing fail-loud behaviour.
  • Reduced markdown is cached normally; user re-runs paper7 kb ingest <id> when they want to retry ar5iv.

Why

paper7 kb ingest 2306.06891 (the issue example, a Findings ACL 2023 paper) hits ar5iv, gets a 200 response with an 8KB "Untitled Document" stub, fails the article-tag check, and abandons the ingest — even though arxiv.get already returned the full title/authors/abstract. There is no path through the current code to land that paper in cache without manually downloading the PDF. With this PR the paper lands as a reduced (abstract-only) cache entry and the user keeps moving.

Design note: ACL Anthology branch dropped

The original spec proposed an ar5iv → ACL Anthology HTML → abstract-only chain via a DBLP lookup. Empirical probing during implementation showed DBLP indexes the arxiv preprint and the published ACL version under different keys — searching by arxiv id (q=2306.06891) returns only the CoRR record, not the conference record. The ACL branch would need a two-step lookup (arxiv id → title → ACL) with marginal coverage gain over abstract-only (which already gives the user a working result). Skipping it keeps the fallback chain to one branch and removes an external dependency on DBLP.

Fallback truth table

ar5iv failure mode Behaviour
Ar5ivDecodeError (no <article> tag) warn → abstract-only
Ar5ivHttpError (e.g., 404) warn → abstract-only
Ar5ivTimeoutError fail with GetAr5ivError (user retries)
Ar5ivTransientError fail with GetAr5ivError (already retries internally)
ar5iv returns valid HTML canonical markdown (no regression)

Test plan

  • npx vitest run → 171 pass / 0 fail.
  • npx tsc --noEmit clean.
  • New tests: falls back to abstract-only when ar5iv has no full-text render covers both Ar5ivDecodeError and Ar5ivHttpError paths via the rootCommand integration; does not fall back when ar5iv times out confirms timeout still surfaces; new yieldable failure test confirms Ar5ivTimeoutError still wraps in GetAr5ivError.
  • Replaced the old yields GetAr5ivError with nested Ar5ivDecodeError test (which asserted the now-removed behaviour) with falls back to abstract-only on Ar5ivDecodeError.

Closes #22.

…ext render

ar5iv returns HTTP 200 with a 'Untitled Document' stub (no <article>
tag) for papers it never rendered — ACL findings short papers and some
older arxiv entries are the recurring cases. Previously this bubbled
up as Ar5ivDecodeError and killed kb ingest terminally, even though
arxiv metadata (title, authors, abstract) was already fetched
successfully.

When the ar5iv branch fails with Ar5ivDecodeError or Ar5ivHttpError,
the fetch chain now silently switches to abstract-only markdown built
from the already-fetched arxiv metadata, logs a warning, and caches
the reduced result. The paper still lands in cache; the user can
re-run kb ingest later if ar5iv catches up.

Network-flavoured ar5iv errors (Ar5ivTimeoutError, Ar5ivTransientError)
keep their existing fail-loud behaviour — the assumption there is
'try again later', not 'this paper has no render'.

Dropped the DBLP-based ACL Anthology fallback that the original spec
proposed: DBLP indexes the arxiv preprint and the published ACL
version under different keys, and a single search by arxiv id returns
only the CoRR record. The ACL branch would have needed a two-step
lookup with marginal coverage gain — empirically not worth the
fragility. Abstract-only is the safe lowest common denominator.

Closes p7dotorg#22.

@EduSantosBrito EduSantosBrito left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effect-idiomatic review: this one looks good overall.

This PR is heading in the right direction for Effect: it recovers only from the ar5iv failures that are semantically recoverable (Ar5ivDecodeError / Ar5ivHttpError) and leaves timeout/transient failures in the error channel. That distinction is exactly the kind of typed-error boundary Effect is good at expressing.

A small cleanup I would suggest, mostly to make the intent clearer for future readers:

Ar5ivClient.use((ar5iv) => ar5iv.getHtml(id)).pipe(
  Effect.catchTags({
    Ar5ivDecodeError: () => fallback,
    Ar5ivHttpError: () => fallback
  }),
  Effect.mapError((error) => new GetAr5ivError({ error }))
)

Why: Effect.catchTags documents at the type level that only those tags are recovered. Any unlisted ar5iv error remains in the error channel, so the compiler and reader both see that timeout/transient errors still fail loud.

Also consider returning Option.Option<string> instead of string | undefined from fetchAr5ivHtmlOrFallback. Option is the more idiomatic Effect representation for "the full text may be absent", while undefined is just a JS sentinel. Not required for this PR, but it would make the fallback path more explicit.

Verification: I checked this PR locally with tsc --noEmit and the get tests; they pass.

@EduSantosBrito

Copy link
Copy Markdown
Collaborator

Follow-up: small Effect API suggestions for this PR.

  1. Effect.catchTags

The current behavior is right. catchTags would just encode the intent more directly:

Ar5ivClient.use((ar5iv) => ar5iv.getHtml(id)).pipe(
  Effect.catchTags({
    Ar5ivDecodeError: () => fallbackToAbstractOnly(id),
    Ar5ivHttpError: () => fallbackToAbstractOnly(id)
  }),
  Effect.mapError((error) => new GetAr5ivError({ error }))
)

This makes it obvious that timeout/transient errors are not recovered.

  1. Option

Option.Option<string> would communicate "full text may be unavailable" better than string | undefined:

const fetchAr5ivHtmlOrFallback = (...): Effect.Effect<Option.Option<string>, GetError, Ar5ivClient>

Then the markdown builder can Option.match between abstract-only and canonical markdown.

  1. Optional: Effect.fn

fetchAr5ivHtmlOrFallback is a reusable business operation and a nice candidate for Effect.fn("fetchAr5ivHtmlOrFallback"). That gives better tracing/stack context as this flow grows.

Replace broad Effect.catch with Effect.catchTags so only Ar5ivDecodeError
and Ar5ivHttpError trigger the abstract-only fallback. Timeout and
transient errors stay in the typed error channel as before, now visible
to the compiler and the reader.
@Vigtu

Vigtu commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the catchTags suggestion in 996cb36Effect.catchTags({ Ar5ivDecodeError, Ar5ivHttpError }) so timeout/transient ar5iv errors stay typed in the error channel.

Skipped the optional Option<string> change to keep the diff minimal per the "not required for this PR" note; happy to follow up separately if you'd like it.

@Vigtu Vigtu requested a review from EduSantosBrito May 21, 2026 21:33
@lucianfialho lucianfialho merged commit d622521 into p7dotorg:main May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: fallback to arxiv PDF / ACL Anthology when ar5iv has no HTML render

3 participants