These are all of the Python scripts I used for scraping. These aren't maintained or meant to be regularly usable; in fact I built this all fairly out of order.
My initial approach was to use Supabase, their internal pg-vector, and Cloudinary. I ended up ripping these out and instead using Turbopuffer and Cloudflare R2. The reason was:
- Cloudinary was super expensive! I was paying $100/mo for personal use when I went over their free plan.
- Embeddings were really slow in the old setup. I was using a Supabase edge function that would calculate embeddings with their
Supabase/gte-smallmodel. I could've spun up something on a GPU somewhere to serve these queries, but I think this would've been overbuilt / I didn't want to build an inference pipeline, so I was restricted to models people already have severless endpoints for — none havingSupabase/gte-small. Together.ai servesBAAI/bge-base-en-v1.5and I have had good experiences with it before, so I swapped / re-embedded everything with it. - I integrated Turbopuffer elsewhere and was really impressed by its performance characteristics and how ergonomic the APIs were. Their founder Simon has given great talks and is clearly really sharp. As such, I decided to migrate, adding a
uploaded_to_turbopuffer_atcolumn on the page table in Supabase, running a batch job on my laptop to upload them all up, and then removing Supabase entirely.
A bit of a guide to this directory:
scrape-and-download.pytraverses the Whole Earth Index site to download each issue PDF.issue-metadata.pygets the links / description from the Whole Earth Index site.check-complete.pywould check how many PDFs had been scraped.upload-images.pywould extract each individual page from a PDF, create a page row, and upload it to Cloudinary.percent-complete.pywould check the page splitting process progres.make_pages.pyandfix-page-numbers.pywould add the totalnum_pagesto eachissuerecord. The first pass of scraping didn't get everything.generate-embeddings.jswas my initial embedding pass locally/on CPU with theSupabase/gte-smallmodel.generate-bge-embeddings.pywas my second embedding pass, using Together.ai and theBAAI/bge-base-en-v1.5modelmigrate-page-to-r2.pywould temp download the Cloudinary page images and upload them to R2. A nice optimization here probably would've just been to fetch the id (ther2_object_idis justpages/{page_id}), but I wanted to keep the whole thing in the DB to verify I'd uploaded everything.turbopuffer-page.pyfetches a batch of page records, uploads them to R2, and marks them as uploaded in the database.
All of these are 'ephemeral' scripts, not really meant to be run multiple times, basically all are AI generated. Many of them follow 'meh' code practices, i.e. many will not check certain environment variables exist, make a lot of assumptions about the underlying data, etc. Use at your own risk!