Improve docs_crawler with sitemap seeding and async parallel fetching

### What feature do you want to see added?

## Problem

The current docs_crawler.py uses synchronous, sequential HTTP requests (stack-based DFS) to crawl Jenkins documentation pages. This makes the crawl slow, as each page is fetched one at a time. Additionally, the crawler relies entirely on link discovery within pages, which may miss pages that aren't linked from other doc pages but are listed in the sitemap.

## Proposed Solution

Replace the synchronous crawler with a hybrid approach that:

1. **Seeds from `sitemap.xml`** — Fetches `https://www.jenkins.io/sitemap.xml` and extracts all `/doc/` URLs upfront, ensuring comprehensive coverage of published pages.
2. **Follows in-page links** — Still discovers additional URLs by parsing links on each fetched page, catching anything the sitemap might miss.
3. **Uses async parallel fetching** — Replaces synchronous `requests` calls with `aiohttp` and an async worker pool, significantly improving crawl speed.

## Future Improvements

- **Incremental crawl using `<lastmod>`** — The sitemap XML includes a `<lastmod>` timestamp for each URL. We can store the last crawl timestamp and on subsequent runs only re-fetch pages whose `<lastmod>` is newer, making routine updates much faster instead of re-crawling everything from scratch.

### Upstream changes

_No response_

### Are you interested in contributing this feature?

Yes, I have already tested it locally

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve docs_crawler with sitemap seeding and async parallel fetching #279

What feature do you want to see added?

Problem

Proposed Solution

Future Improvements

Upstream changes

Are you interested in contributing this feature?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Improve docs_crawler with sitemap seeding and async parallel fetching #279

Description

What feature do you want to see added?

Problem

Proposed Solution

Future Improvements

Upstream changes

Are you interested in contributing this feature?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions