Skip to content

chris-spencer/web-spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Spider

This project lets you recursively spider domains, extract all text from web pages, and save everything as plain-text files suitable for feeding into a RAG pipeline or vector store.

The current version is a PHP web UI that replaces an earlier Python-only CLI script.

Web UI (Control Panel)

Features

  • Per-domain projects: each domain is stored in its own folder under data/.
  • Recursive crawl: follows internal links only, staying on the same host.
  • Throttle control: configurable pause between page fetches.
  • Start / stop spidering: launch background spiders and request clean stops.
  • Clean data: remove data for a single domain or all domains.
  • Zip & download: download all text files for a domain as a ZIP archive.
  • Metadata export: per-domain metadata.jsonl describing each crawled page.
  • RAG data profile: UI shows estimated tokens, chunks, corpus size, and primary language.

Requirements

  • PHP 8+ with:
    • curl
    • dom
    • zip (ZipArchive)

Running the UI

From the project root:

php -S localhost:8000 index.php

Then open http://localhost:8000 in your browser.

Workflow

  1. Add domains
    • Use the Add Domains panel.
    • Enter one or more domains (e.g. example.com, docs.example.com) separated by new lines or commas.
    • Choose a pause between requests (seconds) to throttle crawling.
    • Pick data-quality options:
      • Strip common layout elements before extracting text.
      • Export metadata.jsonl per domain.
      • Enable simple language detection.
  2. Start spidering
    • For each domain card, click Start Spider (or Re-run Spider to refresh).
    • A background PHP worker crawls the site, saving one .txt file per URL to data/{domain_key}/.
  3. Monitor progress
    • The dashboard auto-updates:
      • Status (READY / RUNNING / COMPLETED / STOPPED / ERROR).
      • Pages crawled.
      • Last URL.
      • Start / finish timestamps.
      • Per-domain RAG profile: pages, chars, tokens, estimated 1k-token chunks, corpus size (KB), primary language.
      • Global stats in the hero: total domains, pages, tokens, chunks, and corpus size.
  4. Stop / restart
    • Click Stop on a running domain to request a graceful stop.
    • You can later click Start Spider again to run another crawl with the same settings.
  5. Clean or re-run
    • Clean: remove all spidered data for a domain.
    • Clean All Domains: remove every domain folder under data/.
    • Re-run Spider: start another crawl for a domain (e.g. after the site content changes).
  6. Download for RAG
    • Use Zip & Download to get all .txt files for a domain.
    • Use Metadata (.jsonl) to download a JSONL file where each line is a record:
      • url, filename, title, text, crawled_at, language, char_length, approx_tokens.

About

Web Spider

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages