[recipes] URL Batch Import#339
Conversation
Fetch a list of URLs (news, blog posts, web pages), summarize each with an LLM via OpenRouter, and import them as searchable thoughts in Open Brain with SHA-256 content-fingerprint dedup. - Incremental sync-log persistence so an interrupted run doesn't lose dedup progress (writes after every URL, not just at the end) - URL-stable fingerprint: hashes a normalized URL (strips utm_* and other tracking params, www, trailing slash, fragment) for reliable cross-run deduplication instead of volatile date+summary content - backfill-compute.ts helper that reuses the importer's exact fingerprint function to migrate existing rows - import.meta.main guard so the module is importable for testing/reuse - sample-urls.csv and test-urls.txt examples; recipe-local .gitignore keeps .env, *.log, sync-log.json, and personal *.csv lists out of git Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Hey @kae36 — welcome to Open Brain Source! 👋 Thanks for submitting your first PR. The automated review will run shortly and check things like metadata, folder structure, and README completeness. If anything needs fixing, the review comment will tell you exactly what. Once the automated checks pass, a human admin will review for quality and clarity. Expect a response within a few days. If you have questions, check out CONTRIBUTING.md or open an issue. |
Surround fenced code blocks with blank lines so the repo's Markdown Lint check (markdownlint-cli2, .github/.markdownlint.jsonc) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Heads-up for reviewers 👋 — the On the recipe itself:
Happy to adjust anything once a maintainer can approve the workflow runs. Thanks! |
Use a real-browser User-Agent (Chrome/124) and richer Accept headers to avoid HTTP 403 bot-blocks on news/lifestyle sites. Skip non-URL lines (e.g. section markers like "START") with a notice instead of counting them as fetch failures. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contribution Type
/recipes)What does this do?
Adds a
url-batch-importrecipe: read a list of URLs (.txtor.csv), fetch and extract each page, generate an LLM summary + metadata via OpenRouter, and import each as a searchable thought in Open Brain with SHA-256 content-fingerprint dedup. One article = one thought; the full extracted text is kept inmetadata.raw_text.Requirements
content-fingerprint-dedupprimitive for the DB-level dedup column (linked in the README; the script degrades gracefully without it)Checklist
README.mdwith prerequisites, step-by-step instructions, and expected outcomemetadata.jsonhas all required fieldsNotes
Built for resilient, repeatable imports:
utm_*/fbclid/etc.,www, trailing slash, fragment) so the same article dedups reliably across runs, instead of hashing volatile date+summary content.backfill-compute.tshelper to migrate existing rows to the URL-based fingerprint, reusing the importer's exact function.failures.log.🤖 Generated with Claude Code