Skip to content

Commit ab364a7

Browse files
author
Goncalves, Carla
committed
Add Cat Crawler project blog post
1 parent 47cb697 commit ab364a7

1 file changed

Lines changed: 172 additions & 0 deletions

File tree

docs/blog-cat-crawler.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Cat Crawler: why I built it, what is interesting about it, and what I would improve next
2+
3+
Official links:
4+
5+
- Project page: [https://carlashub.github.io/site-crawler/](https://carlashub.github.io/site-crawler/)
6+
- Repository: [https://github.com/CarlasHub/site-crawler](https://github.com/CarlasHub/site-crawler)
7+
8+
<video src="./assets/media/site-crawler-demo.webm" controls muted playsinline preload="metadata" poster="./assets/screenshots/01-dashboard.png" width="100%"></video>
9+
10+
If the video does not render inline in your browser, open [the demo video directly](./assets/media/site-crawler-demo.webm).
11+
12+
Cat Crawler started from a pretty simple need.
13+
14+
I wanted a quick checker for real websites, especially the large messy ones with hundreds of listing pages, redirects, URL variants, and sections you do not want to crawl fully. I also did not want to keep leaning on paid tools or outsourced web tools for something that was specific to the kind of review work I was already doing.
15+
16+
So this project became a small crawler and audit tool that I could shape around the problems I kept seeing.
17+
18+
It is a React frontend with a Node backend. The frontend starts a crawl job, polls for progress, and then turns the results into grouped reports. The backend does the crawling, stays on one host, respects `robots.txt`, starts from sitemap discovery when possible, and applies rules like exclude paths, path limits, query handling, broken-link checks, redirect review, parameter audit, soft-failure review, and URL pattern grouping.
19+
20+
What I still find most interesting about it is that it is not just trying to answer one question.
21+
22+
It is not only “are there broken links?”
23+
24+
It is also:
25+
26+
- where are redirects getting messy
27+
- where are query parameters being dropped or handled badly
28+
- which pages return `200` but still look broken
29+
- where is a site producing duplicate-looking URL structures
30+
- which problems probably matter more because they repeat or touch core flows
31+
32+
That combination matters more on big sites than on small ones.
33+
34+
On a large site, especially one with a lot of listing pages, search pages, jobs pages, filters, country or language paths, and old legacy redirects, a normal click-through is not enough. You need some way to crawl fast, ignore noisy sections, cap sections that explode, and still get something readable back at the end.
35+
36+
That is a big part of why this project has exclude paths, path-based crawl caps, optional job-page suppression, presets, and a bookmarklet launcher. I was trying to make it useful on the kinds of sites where one bad section can flood the whole result set.
37+
38+
## What is interesting about the project
39+
40+
The part I like most is that it stays quite practical.
41+
42+
It does not pretend to be a giant all-purpose testing platform. It takes a site URL, runs a crawl, and gives you grouped results you can actually work through.
43+
44+
A few parts stand out to me:
45+
46+
### 1. It was built for large, noisy sites
47+
48+
This is probably the main thing.
49+
50+
The project is clearly shaped around sites with lots of repeated templates and large listing areas. The path limits, exclude rules, and “ignore job pages” option are not decorative. They are there because some sections of a site can drown the crawl if you do not control them.
51+
52+
That makes the tool much more useful for real teams working on career sites, large content estates, or any site with lots of repeated listing structures.
53+
54+
### 2. It does more than link checking
55+
56+
Basic link checking is useful, but it is not enough.
57+
58+
Cat Crawler also looks at redirect chains, parameter handling, soft failures, duplicate URL patterns, legacy/current path pairs, and issue impact. That makes it more useful during launches, migration work, cleanup work, and regression checking.
59+
60+
### 3. The bookmarklet is small but useful
61+
62+
The bookmarklet does not do the crawl itself. It opens the app in a floating panel and passes the current page URL into it.
63+
64+
I like that because it keeps the heavy work in the app and backend, but still gives the person using it a very quick way to start from the page they are already looking at.
65+
66+
### 4. It tries to keep output readable
67+
68+
A lot of crawl tools dump a pile of URLs and leave it there.
69+
70+
This one at least tries to group the work into clearer sections: audit report, validation report, redirect audit, parameter audit, soft failures, URL patterns, issue impact, and duplicate content candidates. That makes it easier to review with a team instead of handing someone a raw export and wishing them luck.
71+
72+
### 5. It is honest about limits
73+
74+
The README is actually clear on this, and I think that helps.
75+
76+
It only crawls one host per run. Soft-failure detection is heuristic. Pattern and impact analysis help with review, but they do not replace judgement. That is the right tone for this kind of tool.
77+
78+
## What I learned building it
79+
80+
I learned a few things pretty quickly.
81+
82+
### A quick checker stops being quick once a site gets big
83+
84+
The original instinct was speed. Just give me a quick way to spot problems.
85+
86+
That works up to a point. Then the site gets bigger, the listing sections get noisier, the redirects get stranger, and the result set becomes useless unless the tool has some idea of scope. That is where exclude paths, path caps, presets, and grouped reporting stopped being “nice to have” and became basic survival.
87+
88+
### A `200` page can still be broken
89+
90+
This seems obvious, but it matters.
91+
92+
A page can return success and still be bad because the content did not load, an API failed, or the page is showing error text inside a successful response. That is why the soft-failure work matters. It is not perfect, but it points at something simple status checks miss all the time.
93+
94+
### Teams do not need more output, they need better sorting
95+
96+
Once there is enough data, the question changes.
97+
98+
It is no longer “did we find enough?”
99+
100+
It becomes “can anyone make sense of this without wasting half a day?”
101+
102+
That is where grouped reports, impact hints, presets, and section-specific views become more useful than just collecting more rows.
103+
104+
### Big sites need deliberate exclusions
105+
106+
You cannot treat every part of a site equally.
107+
108+
Some sections are worth crawling deeply. Some should be capped. Some should be skipped unless you are checking them on purpose. The tool became better once that was treated as part of the job instead of as an awkward extra.
109+
110+
### The tool needs to fit how people already work
111+
112+
The bookmarklet piece reminded me of this.
113+
114+
People are already on a page when they realise they want to check something. Starting from that real page matters. Saving presets matters too, because teams repeat similar checks over and over.
115+
116+
## Five improvements I would make next
117+
118+
There is already a roadmap in the docs, and I think it points in the right direction. If I were carrying this further, these are the five improvements I would care about most.
119+
120+
### 1. Crawl history and compare mode
121+
122+
This would be one of the most useful additions for real teams.
123+
124+
Being able to rerun a previous crawl and compare it against an earlier run would make release checking much stronger. It would help answer the question people actually ask after a deploy: what changed, what got better, and what got worse?
125+
126+
### 2. Better deduping and clearer issue ranking
127+
128+
Large sites repeat the same problem many times.
129+
130+
The tool already has impact analysis, but I would push this further. Better collapse repeated issues, show stronger grouping, and make it easier to tell which problems are noise and which ones actually deserve attention first.
131+
132+
### 3. Better section summaries for listing-heavy sites
133+
134+
This project is clearly aimed at large sites with repeated listing structures, so I would lean into that more.
135+
136+
I would want section-level summaries that tell a team, for example, how `/jobs`, `/blog`, or `/locations` behaved as a group instead of making them read page-by-page output first.
137+
138+
### 4. Shareable report views for teams
139+
140+
Right now exports help, but I think this could go further.
141+
142+
A cleaner shareable report view would make handoff easier for QA, developers, SEO people, content teams, and project managers. It would be better if people could open one view and see the grouped findings without having to pass raw files around.
143+
144+
### 5. Stronger page context for failures
145+
146+
This one matters because a URL on its own is often not enough.
147+
148+
I would want better page context around issues: stronger clues about the page type, the template pattern, the title, the source of the bad link, and why the tool thinks something matters. That would cut review time a lot on large sites.
149+
150+
## How this serves teams
151+
152+
I think the clearest value of Cat Crawler is that it helps different people look at the same site from slightly different angles without needing a different tool for each one.
153+
154+
For QA teams, it helps with launch checks, regression passes, redirect review, and broken-path review.
155+
156+
For developers, it helps spot route problems, dropped parameters, repeated bad patterns, and sections that need cleaner rules.
157+
158+
For SEO or content teams, it helps surface duplicate-looking paths, legacy/current path mismatches, redirect problems, and weak sections of site structure.
159+
160+
For project teams, the presets and grouped reports help turn repeat checks into something less manual.
161+
162+
That is probably the main thing I would say about the project now. It started as a quick checker, but it became more useful once it stopped trying to crawl everything blindly and started helping people work through large, messy sites in a more controlled way.
163+
164+
## What I still like about it
165+
166+
I still like that it is very direct.
167+
168+
Open the app. Start a crawl. Control the noisy sections. Review the grouped output. Export if needed. Use the bookmarklet if you are already on the page you want to start from.
169+
170+
That is a good shape for this kind of tool.
171+
172+
It is not trying to do everything. It is trying to be useful where big public sites usually get messy.

0 commit comments

Comments
 (0)