Skip to content

[recipes] URL Batch Import#339

Open
kae36 wants to merge 5 commits into
NateBJones-Projects:mainfrom
kae36:contrib/kae36/url-batch-import
Open

[recipes] URL Batch Import#339
kae36 wants to merge 5 commits into
NateBJones-Projects:mainfrom
kae36:contrib/kae36/url-batch-import

Conversation

@kae36

@kae36 kae36 commented Jun 1, 2026

Copy link
Copy Markdown

Contribution Type

  • Recipe (/recipes)

What does this do?

Adds a url-batch-import recipe: read a list of URLs (.txt or .csv), fetch and extract each page, generate an LLM summary + metadata via OpenRouter, and import each as a searchable thought in Open Brain with SHA-256 content-fingerprint dedup. One article = one thought; the full extracted text is kept in metadata.raw_text.

Requirements

  • A working Open Brain (Supabase + pgvector) setup
  • Deno runtime
  • OpenRouter API key (same one used in Open Brain setup)
  • Optional: the content-fingerprint-dedup primitive for the DB-level dedup column (linked in the README; the script degrades gracefully without it)

Checklist

  • I've read CONTRIBUTING.md
  • My contribution has a README.md with prerequisites, step-by-step instructions, and expected outcome
  • My metadata.json has all required fields
  • Dependency on the content-fingerprint-dedup primitive is linked in the README
  • I tested this on my own Open Brain instance
  • No credentials, API keys, or secrets are included

Notes

Built for resilient, repeatable imports:

  • Resumable sync-log with incremental persistence — an interrupted run no longer loses dedup progress (writes after every URL).
  • URL-stable fingerprint — hashes a normalized URL (strips utm_*/fbclid/etc., www, trailing slash, fragment) so the same article dedups reliably across runs, instead of hashing volatile date+summary content.
  • backfill-compute.ts helper to migrate existing rows to the URL-based fingerprint, reusing the importer's exact function.
  • Graceful handling of failed fetches (paywalls, 403s, JS SPAs) logged to failures.log.

🤖 Generated with Claude Code

Fetch a list of URLs (news, blog posts, web pages), summarize each with an
LLM via OpenRouter, and import them as searchable thoughts in Open Brain
with SHA-256 content-fingerprint dedup.

- Incremental sync-log persistence so an interrupted run doesn't lose
  dedup progress (writes after every URL, not just at the end)
- URL-stable fingerprint: hashes a normalized URL (strips utm_* and other
  tracking params, www, trailing slash, fragment) for reliable cross-run
  deduplication instead of volatile date+summary content
- backfill-compute.ts helper that reuses the importer's exact fingerprint
  function to migrate existing rows
- import.meta.main guard so the module is importable for testing/reuse
- sample-urls.csv and test-urls.txt examples; recipe-local .gitignore
  keeps .env, *.log, sync-log.json, and personal *.csv lists out of git

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the recipe Contribution: step-by-step recipe label Jun 1, 2026
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

Hey @kae36 — welcome to Open Brain Source! 👋

Thanks for submitting your first PR. The automated review will run shortly and check things like metadata, folder structure, and README completeness. If anything needs fixing, the review comment will tell you exactly what.

Once the automated checks pass, a human admin will review for quality and clarity. Expect a response within a few days.

If you have questions, check out CONTRIBUTING.md or open an issue.

Surround fenced code blocks with blank lines so the repo's Markdown Lint
check (markdownlint-cli2, .github/.markdownlint.jsonc) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kae36

kae36 commented Jun 1, 2026

Copy link
Copy Markdown
Author

Heads-up for reviewers 👋 — the ob1-gate check showing as failed on this PR appears to be a repo-wide workflow issue, not specific to this contribution. It's startup-failing (0 jobs run) on all recent PRs and even on direct pushes to main, so it looks like the workflow file itself can't start rather than anything in this recipe.

On the recipe itself:

  • Markdown Lint passes locally against .github/.markdownlint.jsonc (0 errors). It's currently waiting on the first-time-contributor "Approve and run workflows" before it can execute here.
  • No credentials/secrets are included — .env is gitignored, and a recipe-local .gitignore keeps logs/sync-log/personal URL lists out.

Happy to adjust anything once a maintainer can approve the workflow runs. Thanks!

kae36 and others added 3 commits June 1, 2026 21:58
Use a real-browser User-Agent (Chrome/124) and richer Accept headers to
avoid HTTP 403 bot-blocks on news/lifestyle sites. Skip non-URL lines
(e.g. section markers like "START") with a notice instead of counting
them as fetch failures.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

recipe Contribution: step-by-step recipe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant