Web Spider

This project lets you recursively spider domains, extract all text from web pages, and save everything as plain-text files suitable for feeding into a RAG pipeline or vector store.

The current version is a PHP web UI that replaces an earlier Python-only CLI script.

Web UI (Control Panel)

Features

Per-domain projects: each domain is stored in its own folder under data/.
Recursive crawl: follows internal links only, staying on the same host.
Throttle control: configurable pause between page fetches.
Start / stop spidering: launch background spiders and request clean stops.
Clean data: remove data for a single domain or all domains.
Zip & download: download all text files for a domain as a ZIP archive.
Metadata export: per-domain metadata.jsonl describing each crawled page.
RAG data profile: UI shows estimated tokens, chunks, corpus size, and primary language.

Requirements

PHP 8+ with:
- curl
- dom
- zip (ZipArchive)

Running the UI

From the project root:

php -S localhost:8000 index.php

Then open http://localhost:8000 in your browser.

Workflow

Add domains
- Use the Add Domains panel.
- Enter one or more domains (e.g. example.com, docs.example.com) separated by new lines or commas.
- Choose a pause between requests (seconds) to throttle crawling.
- Pick data-quality options:
  - Strip common layout elements before extracting text.
  - Export metadata.jsonl per domain.
  - Enable simple language detection.
Start spidering
- For each domain card, click Start Spider (or Re-run Spider to refresh).
- A background PHP worker crawls the site, saving one .txt file per URL to data/{domain_key}/.
Monitor progress
- The dashboard auto-updates:
  - Status (READY / RUNNING / COMPLETED / STOPPED / ERROR).
  - Pages crawled.
  - Last URL.
  - Start / finish timestamps.
  - Per-domain RAG profile: pages, chars, tokens, estimated 1k-token chunks, corpus size (KB), primary language.
  - Global stats in the hero: total domains, pages, tokens, chunks, and corpus size.
Stop / restart
- Click Stop on a running domain to request a graceful stop.
- You can later click Start Spider again to run another crawl with the same settings.
Clean or re-run
- Clean: remove all spidered data for a domain.
- Clean All Domains: remove every domain folder under data/.
- Re-run Spider: start another crawl for a domain (e.g. after the site content changes).
Download for RAG
- Use Zip & Download to get all .txt files for a domain.
- Use Metadata (.jsonl) to download a JSONL file where each line is a record:
  - url, filename, title, text, crawled_at, language, char_length, approx_tokens.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.DS_Store		.DS_Store
README.md		README.md
index.php		index.php
spider_lib.php		spider_lib.php
spider_worker.php		spider_worker.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Spider

Web UI (Control Panel)

Features

Requirements

Running the UI

Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Spider

Web UI (Control Panel)

Features

Requirements

Running the UI

Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages