Skip to content

Commit 070e940

Browse files
author
Goncalves, Carla
committed
Refresh crawler documentation
1 parent 418cb04 commit 070e940

3 files changed

Lines changed: 53 additions & 24 deletions

File tree

README.md

Lines changed: 35 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Cat Crawler
22

3-
> Discover internal URLs with exclusions, redirects, duplicates, presets, and language-agnostic path limits.
3+
> Crawl and validate websites for broken links, redirects, parameter handling, soft failures, URL patterns, and impact.
44
55
![UI screenshot](docs/screenshot.png)
66

@@ -30,13 +30,20 @@
3030
- Sitemap-first discovery when sitemap.xml is available
3131
- robots.txt respected before fetching any URL
3232
- Same-host crawling with optional scope limited to the start path
33+
- Navigation audit across both anchor links and form actions
3334
- Exclude paths using relative paths, one per line
3435
- Language-agnostic crawl limits by path (for example `/job` also matches `/en/job`, `/fr/job`)
3536
- Ignore job pages by default to prevent job-heavy sections from flooding results
36-
- Redirect resolution with original URL and final URL stored
37+
- Redirect resolution with full redirect chains, per-step status codes, and final URL stored
3738
- Optional broken link quick check with HTTP status recording
39+
- Parameter audit for `?test=1`, `?page=2`, and `?filter=value`
40+
- Soft-failure detection for successful pages with missing content, error text, or failed API/XHR endpoints
41+
- URL pattern audit for duplicate structures, legacy/current paths, and inconsistent naming
42+
- Impact analysis for broken and redirected URLs based on repetition, referrers, and core-flow heuristics
43+
- Consolidated validation report with broken URLs, redirect issues, parameter issues, soft failures, and impact analysis
3844
- Duplicate content candidates detection, including querystring and language variants
3945
- Client presets saved in localStorage, export and import presets as JSON
46+
- Bookmarklet opens in a draggable, resizable in-page panel with 4-corner resize handles
4047
- Glass UI with progress ring and animated orb during crawl
4148

4249
---
@@ -45,7 +52,7 @@
4552

4653
This project is deployed on a personal Cloud Run host:
4754

48-
https://site-crawler-909296093050.europe-west2.run.app/
55+
https://site-crawler-989268314020.europe-west2.run.app/
4956

5057
For production use, deploy to **your own** Cloud Run service and update `APP_ORIGIN` in `docs/bookmarklet.js`.
5158

@@ -79,19 +86,19 @@ For production use, deploy to **your own** Cloud Run service and update `APP_ORI
7986
2. Add optional exclude paths such as `/jobs`, `/careers`, `/admin`.
8087
3. Define crawl limits by path if required.
8188
4. Configure max pages and concurrency.
82-
5. Choose options such as ignoring job pages or running a broken link check.
89+
5. Choose options such as ignoring job pages, running a broken link check, or enabling parameter audit.
8390
6. Run the crawl.
84-
7. Review results and export TXT or CSV if needed.
91+
7. Review the validation report, audit sections, and export TXT or CSV if needed.
8592

8693
### Quick start (step-by-step)
8794
1. Paste the site homepage in **Homepage URL** (e.g. `https://example.com`).
8895
2. Add **Exclude paths** (one per line). Only lines starting with `/` are used.
8996
3. Add **Crawl limits by path** to cap noisy sections (e.g. `/job` max 5).
9097
4. Set **Max pages** and **Concurrency** based on how deep you want to go.
91-
5. Toggle **Ignore job pages** or **Broken link quick check** if needed.
98+
5. Toggle **Ignore job pages**, **Broken link quick check**, or **Parameter audit** if needed.
9299
6. Click **Run crawl**, then download **TXT** or **CSV** from Results.
93100

94-
Tip: if you enable **Broken link quick check**, you can filter status codes to spot 404s quickly.
101+
Tip: enable **Broken link quick check** to classify live HTTP errors, and enable **Parameter audit** when you need route-level querystring validation.
95102

96103
### Landing page
97104
See a marketing-style overview at `docs/landing.html` (matches the in-app color scheme).
@@ -144,7 +151,12 @@ Each discovered URL is filtered using:
144151
- The UI displays progress using a time-based progress indicator while crawling.
145152

146153
### Results
147-
- Returned URLs include original URL, final URL after redirects, and optional HTTP status.
154+
- Returned URLs include original URL, final URL after redirects, HTTP status, source type, and referrer page.
155+
- Audit entries are classified as `valid`, `broken`, `redirect_issue`, or `soft_failure`.
156+
- Redirect audit highlights loops, multi-hop redirects, dropped params, and irrelevant destinations.
157+
- Soft-failure audit flags successful pages that still fail functionally.
158+
- Impact audit prioritises broken and redirect issues by repetition and core-flow importance.
159+
- Pattern audit groups URLs by structure and highlights inconsistencies.
148160
- Duplicate candidates are grouped by base URL and flagged when query or language variants exist.
149161
- Results can be exported as TXT or CSV.
150162

@@ -165,11 +177,23 @@ Request body:
165177
"concurrency": 6,
166178
"includeQuery": true,
167179
"ignoreJobPages": true,
168-
"brokenLinkCheck": false
180+
"brokenLinkCheck": false,
181+
"parameterAudit": true,
182+
"patternMatchFilter": "/jobs"
169183
}
170184
}
171185
```
172186

187+
Key response sections:
188+
- `urls`: crawled page records
189+
- `audit`: validated navigation entries with referrer pages and classifications
190+
- `issueReport`: broken URLs, redirect issues, parameter issues, soft failures, and impact analysis
191+
- `impactAudit`: prioritised broken/redirect issues
192+
- `redirectAudit`: redirect-chain QA
193+
- `softFailureAudit`: successful-but-broken pages
194+
- `patternAudit`: structural URL grouping and inconsistency detection
195+
- `parameterAudit`: query-parameter handling checks
196+
173197
---
174198

175199
## Bookmarklet (Cat Crawler)
@@ -179,8 +203,9 @@ Use the crawler on the page you are currently visiting.
179203
1. Open the GitHub Pages landing page in `docs/index.html`.
180204
2. Drag the **Cat Crawler** bookmarklet button to your bookmarks bar.
181205
3. Click the bookmark on any site to open **Cat Crawler**. It auto-fills the current page URL.
206+
4. Drag the panel by the top bar and resize it from any of the four corners.
182207

183-
Tip: The landing page button **Cat Crawler 😼** is a live bookmarklet link, so you can drag it to your bookmarks bar to install the latest script.
208+
Tip: The landing page button **Drag Cat Crawler 😼** is now a loader bookmarklet. Reinstall it once, and future bookmarklet UI updates will come from `bookmarklet.js` without another reinstall.
184209

185210
---
186211

docs/index.html

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@
193193
<span class="tag">Cat Crawler · Drag-to-install bookmarklet</span>
194194
<h1>Fast, focused crawling for content teams.</h1>
195195
<p class="sub">
196-
Crawl a site with smart exclusions, language-agnostic limits, and optional broken-link checks. Built for quick QA and clean URL lists.
196+
Crawl a site with smart exclusions, language-agnostic limits, redirect tracing, parameter validation, and soft-failure checks. Built for quick QA and clean URL lists.
197197
Drag the Cat Crawler button straight to your bookmarks bar, then click it on any site to open the crawler with the current page prefilled.
198198
</p>
199199
<div class="cta-row">
@@ -223,11 +223,11 @@ <h3>Why it matters</h3>
223223
</div>
224224
<div class="card">
225225
<h3>What’s new</h3>
226-
<p>Drag-and-drop bookmarklet install, instant launch on the current page, and cleaner exclusion matching.</p>
226+
<p>Draggable and resizable bookmarklet panel, richer audit reports, and cleaner exclusion matching.</p>
227227
</div>
228228
<div class="card">
229229
<h3>Built-in outputs</h3>
230-
<p>Export TXT/CSV with final URLs, redirects, and optional status codes.</p>
230+
<p>Broken links, redirect issues, parameter issues, soft failures, impact analysis, and TXT/CSV export.</p>
231231
</div>
232232
</section>
233233

@@ -259,7 +259,7 @@ <h4>Set path limits</h4>
259259
<div class="step-num">4</div>
260260
<div>
261261
<h4>Choose options</h4>
262-
<p>Adjust max pages/concurrency; toggle “Ignore job pages” or “Broken link quick check.”</p>
262+
<p>Adjust max pages/concurrency; toggle “Ignore job pages,” “Broken link quick check,” or “Parameter audit.”</p>
263263
</div>
264264
</div>
265265
<div class="step">
@@ -271,7 +271,7 @@ <h4>Run and export</h4>
271271
</div>
272272
</div>
273273
<div class="note">
274-
Tip: enable “Broken link quick check” to capture HTTP status codes and spot 404s quickly.
274+
Tip: enable “Broken link quick check” for live HTTP status codes and “Parameter audit” when you need querystring validation.
275275
</div>
276276
</section>
277277

@@ -283,6 +283,7 @@ <h3>Bookmarklet setup (Cat Crawler)</h3>
283283
<li>Drag the <code>Drag Cat Crawler 😼</code> button above to your bookmarks bar.</li>
284284
<li>Open any page you want to inspect.</li>
285285
<li>Click the saved Cat Crawler bookmark to open the crawler with the current URL prefilled.</li>
286+
<li>Drag the header to move it, then resize from any corner.</li>
286287
</ol>
287288
</div>
288289
<div class="card">
@@ -292,6 +293,7 @@ <h3>Quality checklist</h3>
292293
<li>Confirm sitemap available and robots.txt respected</li>
293294
<li>Exclude careers/job sections unless needed</li>
294295
<li>Cap pagination-heavy paths</li>
296+
<li>Check redirect issues, soft failures, and impact ranking before export</li>
295297
<li>Export CSV and share with stakeholders</li>
296298
</ul>
297299
</div>
@@ -300,11 +302,11 @@ <h3>Quality checklist</h3>
300302
<section class="section grid-2">
301303
<div class="card">
302304
<h3>Results you can trust</h3>
303-
<p>Each URL includes original + final destination, optional HTTP status, and flags for duplicates.</p>
305+
<p>Each entry includes source, referrer, final destination, status, and classification across broken, redirect, parameter, and soft-failure checks.</p>
304306
</div>
305307
<div class="card">
306308
<h3>Team-friendly presets</h3>
307-
<p>Save configurations per client, export to JSON, and re-use for future audits.</p>
309+
<p>Save configurations per client, export to JSON, and re-use full audit setups for future validations.</p>
308310
</div>
309311
</section>
310312

docs/landing.html

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@
193193
<span class="tag">Cat Crawler · Drag-to-install bookmarklet</span>
194194
<h1>Fast, focused crawling for content teams.</h1>
195195
<p class="sub">
196-
Crawl a site with smart exclusions, language-agnostic limits, and optional broken-link checks. Built for quick QA and clean URL lists.
196+
Crawl a site with smart exclusions, language-agnostic limits, redirect tracing, parameter validation, and soft-failure checks. Built for quick QA and clean URL lists.
197197
Drag the Cat Crawler button straight to your bookmarks bar, then click it on any site to open the crawler with the current page prefilled.
198198
</p>
199199
<div class="cta-row">
@@ -223,11 +223,11 @@ <h3>Why it matters</h3>
223223
</div>
224224
<div class="card">
225225
<h3>What’s new</h3>
226-
<p>Drag-and-drop bookmarklet install, instant launch on the current page, and cleaner exclusion matching.</p>
226+
<p>Draggable and resizable bookmarklet panel, richer audit reports, and cleaner exclusion matching.</p>
227227
</div>
228228
<div class="card">
229229
<h3>Built-in outputs</h3>
230-
<p>Export TXT/CSV with final URLs, redirects, and optional status codes.</p>
230+
<p>Broken links, redirect issues, parameter issues, soft failures, impact analysis, and TXT/CSV export.</p>
231231
</div>
232232
</section>
233233

@@ -259,7 +259,7 @@ <h4>Set path limits</h4>
259259
<div class="step-num">4</div>
260260
<div>
261261
<h4>Choose options</h4>
262-
<p>Adjust max pages/concurrency; toggle “Ignore job pages” or “Broken link quick check.”</p>
262+
<p>Adjust max pages/concurrency; toggle “Ignore job pages,” “Broken link quick check,” or “Parameter audit.”</p>
263263
</div>
264264
</div>
265265
<div class="step">
@@ -271,7 +271,7 @@ <h4>Run and export</h4>
271271
</div>
272272
</div>
273273
<div class="note">
274-
Tip: enable “Broken link quick check” to capture HTTP status codes and spot 404s quickly.
274+
Tip: enable “Broken link quick check” for live HTTP status codes and “Parameter audit” when you need querystring validation.
275275
</div>
276276
</section>
277277

@@ -283,6 +283,7 @@ <h3>Bookmarklet setup (Cat Crawler)</h3>
283283
<li>Drag the <code>Drag Cat Crawler 😼</code> button above to your bookmarks bar.</li>
284284
<li>Open any page you want to inspect.</li>
285285
<li>Click the saved Cat Crawler bookmark to open the crawler with the current URL prefilled.</li>
286+
<li>Drag the header to move it, then resize from any corner.</li>
286287
</ol>
287288
</div>
288289
<div class="card">
@@ -292,6 +293,7 @@ <h3>Quality checklist</h3>
292293
<li>Confirm sitemap available and robots.txt respected</li>
293294
<li>Exclude careers/job sections unless needed</li>
294295
<li>Cap pagination-heavy paths</li>
296+
<li>Check redirect issues, soft failures, and impact ranking before export</li>
295297
<li>Export CSV and share with stakeholders</li>
296298
</ul>
297299
</div>
@@ -300,11 +302,11 @@ <h3>Quality checklist</h3>
300302
<section class="section grid-2">
301303
<div class="card">
302304
<h3>Results you can trust</h3>
303-
<p>Each URL includes original + final destination, optional HTTP status, and flags for duplicates.</p>
305+
<p>Each entry includes source, referrer, final destination, status, and classification across broken, redirect, parameter, and soft-failure checks.</p>
304306
</div>
305307
<div class="card">
306308
<h3>Team-friendly presets</h3>
307-
<p>Save configurations per client, export to JSON, and re-use for future audits.</p>
309+
<p>Save configurations per client, export to JSON, and re-use full audit setups for future validations.</p>
308310
</div>
309311
</section>
310312

0 commit comments

Comments
 (0)