This project lets you recursively spider domains, extract all text from web pages, and save everything as plain-text files suitable for feeding into a RAG pipeline or vector store.
The current version is a PHP web UI that replaces an earlier Python-only CLI script.
- Per-domain projects: each domain is stored in its own folder under
data/. - Recursive crawl: follows internal links only, staying on the same host.
- Throttle control: configurable pause between page fetches.
- Start / stop spidering: launch background spiders and request clean stops.
- Clean data: remove data for a single domain or all domains.
- Zip & download: download all text files for a domain as a ZIP archive.
- Metadata export: per-domain
metadata.jsonldescribing each crawled page. - RAG data profile: UI shows estimated tokens, chunks, corpus size, and primary language.
- PHP 8+ with:
curldomzip(ZipArchive)
From the project root:
php -S localhost:8000 index.phpThen open http://localhost:8000 in your browser.
- Add domains
- Use the Add Domains panel.
- Enter one or more domains (e.g.
example.com,docs.example.com) separated by new lines or commas. - Choose a pause between requests (seconds) to throttle crawling.
- Pick data-quality options:
- Strip common layout elements before extracting text.
- Export
metadata.jsonlper domain. - Enable simple language detection.
- Start spidering
- For each domain card, click Start Spider (or Re-run Spider to refresh).
- A background PHP worker crawls the site, saving one
.txtfile per URL todata/{domain_key}/.
- Monitor progress
- The dashboard auto-updates:
- Status (READY / RUNNING / COMPLETED / STOPPED / ERROR).
- Pages crawled.
- Last URL.
- Start / finish timestamps.
- Per-domain RAG profile: pages, chars, tokens, estimated 1k-token chunks, corpus size (KB), primary language.
- Global stats in the hero: total domains, pages, tokens, chunks, and corpus size.
- The dashboard auto-updates:
- Stop / restart
- Click Stop on a running domain to request a graceful stop.
- You can later click Start Spider again to run another crawl with the same settings.
- Clean or re-run
- Clean: remove all spidered data for a domain.
- Clean All Domains: remove every domain folder under
data/. - Re-run Spider: start another crawl for a domain (e.g. after the site content changes).
- Download for RAG
- Use Zip & Download to get all
.txtfiles for a domain. - Use Metadata (.jsonl) to download a JSONL file where each line is a record:
url,filename,title,text,crawled_at,language,char_length,approx_tokens.
- Use Zip & Download to get all