Skip to content

Improve docs_crawler with sitemap seeding and async parallel fetching #279

@Ash-934

Description

@Ash-934

What feature do you want to see added?

Problem

The current docs_crawler.py uses synchronous, sequential HTTP requests (stack-based DFS) to crawl Jenkins documentation pages. This makes the crawl slow, as each page is fetched one at a time. Additionally, the crawler relies entirely on link discovery within pages, which may miss pages that aren't linked from other doc pages but are listed in the sitemap.

Proposed Solution

Replace the synchronous crawler with a hybrid approach that:

  1. Seeds from sitemap.xml — Fetches https://www.jenkins.io/sitemap.xml and extracts all /doc/ URLs upfront, ensuring comprehensive coverage of published pages.
  2. Follows in-page links — Still discovers additional URLs by parsing links on each fetched page, catching anything the sitemap might miss.
  3. Uses async parallel fetching — Replaces synchronous requests calls with aiohttp and an async worker pool, significantly improving crawl speed.

Future Improvements

  • Incremental crawl using <lastmod> — The sitemap XML includes a <lastmod> timestamp for each URL. We can store the last crawl timestamp and on subsequent runs only re-fetch pages whose <lastmod> is newer, making routine updates much faster instead of re-crawling everything from scratch.

Upstream changes

No response

Are you interested in contributing this feature?

Yes, I have already tested it locally

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions