What feature do you want to see added?
Problem
The current docs_crawler.py uses synchronous, sequential HTTP requests (stack-based DFS) to crawl Jenkins documentation pages. This makes the crawl slow, as each page is fetched one at a time. Additionally, the crawler relies entirely on link discovery within pages, which may miss pages that aren't linked from other doc pages but are listed in the sitemap.
Proposed Solution
Replace the synchronous crawler with a hybrid approach that:
- Seeds from
sitemap.xml — Fetches https://www.jenkins.io/sitemap.xml and extracts all /doc/ URLs upfront, ensuring comprehensive coverage of published pages.
- Follows in-page links — Still discovers additional URLs by parsing links on each fetched page, catching anything the sitemap might miss.
- Uses async parallel fetching — Replaces synchronous
requests calls with aiohttp and an async worker pool, significantly improving crawl speed.
Future Improvements
- Incremental crawl using
<lastmod> — The sitemap XML includes a <lastmod> timestamp for each URL. We can store the last crawl timestamp and on subsequent runs only re-fetch pages whose <lastmod> is newer, making routine updates much faster instead of re-crawling everything from scratch.
Upstream changes
No response
Are you interested in contributing this feature?
Yes, I have already tested it locally