Crawl a site, test internal navigation, and review redirects, parameter handling, soft failures, and URL-pattern issues in one place.
Cat Crawler is a React frontend plus Node.js backend for website crawl and navigation validation.
It is designed for:
- internal site QA
- launch and regression checks
- redirect and routing reviews
- large-site spot checks where a manual click-through would miss issues
The main UI starts a background crawl job, polls for progress, and then renders the result as a grouped report.
There is also an optional bookmarklet. The bookmarklet does not perform the crawl itself. It opens the deployed Cat Crawler app in a floating panel and passes the current page URL into the app as the starting URL.
Public docs and installer:
- GitHub Pages: https://carlashub.github.io/site-crawler/
- Repository: https://github.com/CarlasHub/site-crawler
- Checking internal links and form actions from a real starting URL
- Verifying redirect behaviour, including multi-hop chains and dropped query parameters
- Reviewing query-driven routes with parameter audit enabled
- Finding pages that return
200but still look broken because content or API requests failed - Spotting duplicate-looking URL structures and inconsistent naming
- Saving repeatable crawl presets for the same client or site area
- Sitemap-first discovery from
robots.txtsitemap entries or defaultsitemap.xml robots.txtenforcement before crawling- Same-host crawling with optional scope limited to the start path
- Excluded-path rules and language-agnostic per-path crawl limits
- Optional job-page suppression for noisy recruitment sections
- Optional broken-link quick check with HTTP status recording
- Optional parameter audit for query-driven routes
- Redirect audit with chain details, loops, multi-hop chains, parameter loss, and irrelevant destinations
- Soft-failure audit for successful pages that still fail functionally
- URL pattern audit for duplicate structures, legacy/current path pairs, and inconsistent naming
- Impact audit to help prioritise repeated or core-flow issues
- TXT and CSV export from the rendered audit report
- Preset save, export, and import in the browser
- Optional bookmarklet runner for opening the app from the page you are already viewing
If GitHub does not render the video inline in your browser, open the demo video directly.
The bookmarklet lives in docs/bookmarklet.js.
Its behaviour is:
- open the full Cat Crawler panel immediately (a title bar, Hide to collapse to a small control, Show to restore)
- load the deployed app in an iframe inside that panel (Express static build ships no
X-Frame-Options; third-partyframe-ancestorson your host is outside this repo) - pass the current page URL as the
urlquery parameter (?mode=bookmarklet&url=...) - show a loading state while the iframe loads and an in-panel error if it times out
- reuse one instance: running the bookmarklet again focuses the same root; if you navigate and run it again, the iframe reloads with the new page URL
- Close tears the UI down completely
The public docs site builds the install link from docs/config.js and docs/install.js. The committed docs/config.js sets appOrigin to the current production app origin: https://site-crawler-989268314020.europe-west2.run.app. For local testing only, run APP_ENV=local node scripts/write-public-config.mjs (never commit that output). scripts/validate-committed-docs-config.mjs rejects loopback or non-HTTPS origins in the tracked file.
High-level flow:
- The frontend submits a crawl job to
POST /api/crawl/start. - The backend validates the request and creates a background crawl job.
- The frontend polls
GET /api/crawl/:jobIdfor progress and final results. - The frontend renders grouped audit sections and offers TXT or CSV export.
Runtime components:
frontend/: React application served as a built Vite SPAbackend/: Express API and crawl enginedocs/: GitHub Pages docs site and bookmarklet loader
Current background-job model:
- local development defaults to file-backed job state
- staging and production must use Firestore-backed job state
- the UI depends on the background-job endpoints for normal use
flowchart LR
U[User Browser] --> D[GitHub Pages Docs]
U --> A[Cat Crawler App]
D --> B[Bookmarklet Loader]
B --> A
A --> F[React Frontend]
A --> E[Express Backend]
E --> J[Background Crawl Jobs]
E --> S[(Firestore in staging/production)]
Current supported runtime contract:
- Node.js
22.x
Important operational constraints:
- Staging and production must use
JOB_STATE_BACKEND=firestore - All production instances must share the same Firestore backend and collection prefix
- Crawl jobs are rate-limited and capped by backend hard limits
- Active crawl jobs are hard-capped to
2(CRAWL_MAX_ACTIVE_JOBS) - Frontend controls are aligned to the backend caps:
maxPagesup to300,concurrencyup to6 - The crawler is for public
http(s)targets only; internal, loopback, link-local, and metadata destinations are blocked - Crawling stays on the same host as the start URL
- The bookmarklet is only a launcher for the app; it does not replace the backend
- Open a local run or your deployed Cat Crawler app.
- Enter a homepage URL such as
https://example.com. - Add any exclude paths you do not want crawled.
- Add optional path limits for noisy sections.
- Choose whether to enable broken-link checking or parameter audit.
- Run the crawl.
- Review the grouped results and export TXT or CSV if needed.
- Open the public docs site: https://carlashub.github.io/site-crawler/
- Drag the bookmarklet button to your bookmarks bar.
- Open the page you want to seed from.
- Click the bookmarklet to open the full panel with Cat Crawler loaded for the current tab’s URL.
Exclude pathsPrevents crawling whole sections such as/jobsor/careers.Crawl limits by pathCaps how many pages are crawled under a path such as/job.Max pagesTotal crawl size cap. The UI and backend both cap this at300.ConcurrencyHow many crawl workers run at once. The UI and backend both cap this at6.Include querystringsKeeps querystring variants in the crawl scope when appropriate.Ignore job pagesSuppresses job-heavy pages by default.Broken link quick checkAdds live HTTP status checking for discovered navigation targets.Parameter auditTests how the site handles parameter variants such as?page=2or?filter=value.URL match filterFilters the rendered results after the crawl is complete.
Audit reportThe main rendered list of crawled navigation entries with source, referrer, status, and classification.Validation reportSummary view of broken URLs, redirect issues, parameter-handling issues, soft failures, and impact issues.Redirect auditFocused view of redirected navigation, including loops, multiple hops, lost params, and irrelevant destinations.Parameter auditFocused view of parameterised URLs and whether parameters were preserved, dropped, or redirected unexpectedly.Soft failuresPages that returned success but still appear broken because content or API behaviour failed.URL patternsStructural grouping for duplicate patterns, legacy/current paths, and inconsistent naming.Issue impactPrioritisation layer for repeated or core-flow issues.Duplicate content candidatesQuick grouping of URL variants that may represent duplicate content.
- Cat Crawler does not claim to replace human QA judgement.
- It only crawls one host per run.
- It only crawls public
http(s)targets that pass the outbound safety checks. - Soft-failure detection is heuristic by design. Treat it as review input, not an absolute verdict.
- Pattern and impact analysis help prioritise review. They do not replace manual interpretation.
- The GitHub Pages docs site is static. It must be configured with the correct
BOOKMARKLET_APP_ORIGINfor a real deployed app before publishing.
- Node.js
22.x - npm
- A reachable public website to crawl for testing
- Install dependencies:
cd frontend
npm ci
cd ../backend
npm ci- Build the frontend:
cd frontend
npm run build- Start the backend:
cd ../backend
npm start- Open the app:
- App UI:
http://localhost:8080 - Health check:
http://localhost:8080/healthz
- Optional: regenerate the local bookmarklet docs config explicitly from the repo root:
APP_ENV=local node scripts/write-public-config.mjsThe local default app origin is http://localhost:8080.
- Use Node
22.x - Run
npm ciinfrontend/andbackend/ - Build the frontend before starting the backend
- Confirm
http://localhost:8080/healthzresponds - Use
APP_ENV=localwhen generating local docs config
Build and run the production-style container locally from the repo root:
docker build -t cat-crawler .
docker run --rm -p 8080:8080 cat-crawler- GitHub Pages serves the static docs and bookmarklet installer only
- The actual crawler app is the Node container built from
Dockerfile - The app serves the built frontend and the backend API on the same origin
- Staging and production require shared Firestore-backed job state
docker build -t cat-crawler .The Docker build:
- builds the frontend in a dedicated stage
- installs production backend dependencies in a separate stage
- copies only runtime files into the final image
This project is designed to run on a container host that can run the repository Dockerfile.
Required production contract:
- expose the app on port
8080 - run the container from this repository
Dockerfile - set
JOB_STATE_BACKEND=firestore - provide Firestore credentials securely
- publish the app on HTTPS
- point
BOOKMARKLET_APP_ORIGINat that final HTTPS app URL when generating docs
- Build and publish the container image
- Set
APP_ENV=stagingorAPP_ENV=production - Set
JOB_STATE_BACKEND=firestore - Point every instance at the same Firestore backend and collection prefix
- Set
TRUST_PROXYintentionally for the real ingress path - Publish the app on HTTPS
- Regenerate
docs/config.jsfor the final public app origin - Re-publish the GitHub Pages docs after
docs/config.jsis updated - Confirm
/healthzresponds from the deployed app
Key production environment variables come from .env.example:
APP_ENVPORTBOOKMARKLET_APP_ORIGINTRUST_PROXYJOB_STATE_BACKENDFIRESTORE_CRAWL_JOBS_COLLECTIONCRAWL_MAX_ACTIVE_JOBS- rate limit and crawl cap variables as needed
Example docs config generation:
APP_ENV=production BOOKMARKLET_APP_ORIGIN=https://site-crawler-989268314020.europe-west2.run.app node scripts/write-public-config.mjsSmarter crawl control:
- Add saved crawl histories with rerun from previous settings.
- Add advanced include and exclude rules with testable pattern previews.
- Add per-section crawl summaries so large sites are easier to review at a glance.
Deeper issue analysis:
- Add clearer issue severity scoring with stronger explanations for why an item matters.
- Add issue deduplication across related URLs so repeated findings are easier to triage.
- Add richer page context for failures, including page title, template clues, and stronger source grouping.
Team workflow improvements:
- Add shareable report views for handoff without exporting raw files first.
- Add comparison mode between two crawls to spot regressions after a release.
- Add more preset tooling for client packs, reusable defaults, and faster setup.
MIT
