Skip to content

Download Gitbook documentation of any site as MD files to use with LLMs

Notifications You must be signed in to change notification settings

Amal-David/gitbook-downloader

Repository files navigation

Documentation Downloader for LLMs

A tool that converts documentation sites into markdown format, optimized for use with Large Language Models (LLMs) like ChatGPT, Claude, and LLaMA.

Purpose

  • Download technical documentation for training custom LLMs
  • Create knowledge bases for ChatGPT, Claude, and other AI assistants
  • Feed documentation into context windows of AI chatbots
  • Generate markdown files optimized for LLM processing

Supported Platforms

The downloader uses a plugin-based architecture with specialized extractors for different documentation platforms:

Extractor Platforms Detection
MintlifyExtractor Mintlify docs (e.g., docs.metadao.fi) id="navigation-items"
VocsExtractor Vocs docs (e.g., metalex-docs.vercel.app, docs.zamm.eth.limo) class="vocs_Sidebar_navigation"
DocusaurusExtractor Docusaurus v2/v3 (e.g., docs.aztec.network, noir-lang.org) class="menu__list" + class="menu__link"
ModernGitBookExtractor Next.js GitBook (e.g., gmtribe.gitbook.io, docs.zama.org) id="table-of-contents" or class="toclink"
GitBookExtractor Traditional GitBook sites nav/aside with ul/ol lists
FallbackExtractor Any site Extracts all same-domain links

Extractors are tried in priority order, and the first one that matches handles the site.

Features

  • Multi-platform support: Automatically detects and handles different documentation frameworks
  • Hierarchical navigation: Preserves document structure with proper depth/indentation
  • Smart content extraction: Removes navigation, sidebars, and boilerplate; keeps main content
  • Table of Contents generation: Creates navigable TOC from extracted pages
  • Duplicate detection: Content hashing prevents duplicate pages
  • Rate limiting: Built-in delays and retry logic with exponential backoff
  • Doc section filtering: Prevents crawling into unrelated documentation areas (e.g., stays in /developers/ without crawling /operators/)
  • Version path filtering: Avoids duplicating content from multiple doc versions (e.g., /nightly/, /next/)

Installation

  1. Clone this repository
  2. Install dependencies:
poetry install

Usage

Using CLI Tool

Download documentation to a markdown file:

poetry run python cli.py download <url> --output <output_file.md>

Example:

poetry run python cli.py download https://docs.example.com/ -o docs.md

Downloading a specific section

Use the --section-only / -s flag to download only pages within a specific documentation section:

poetry run python cli.py download "https://docs.uniswap.org/contracts/liquidity-launchpad/Overview" --section-only -o liquidity-launchpad.md

This restricts crawling to URLs sharing the same path prefix as the starting URL (e.g., /contracts/liquidity-launchpad/), useful for downloading just one section of a large documentation site.

Using Web Interface

  1. Start the web server:
poetry run python app.py
  1. Open your browser and navigate to http://localhost:8080

  2. Enter the URL of a documentation site

  3. Choose to either:

    • View the converted content in your browser
    • Download the content as a markdown file
  4. Use the downloaded markdown with:

    • ChatGPT (paste into conversation)
    • Claude (upload as a file)
    • Custom LLaMA models (include in training data)
    • Any other LLM that accepts markdown input

Testing

Run the test script to verify the downloader works with multiple sites:

poetry run python test.py

This creates a tests-N folder with downloaded documentation from several test sites.

Test Sites

Site Extractor URL
ZAMM VocsExtractor docs.zamm.eth.limo
GMTribe ModernGitBookExtractor gmtribe.gitbook.io
MetaDAO MintlifyExtractor docs.metadao.fi
MetaLeX VocsExtractor metalex-docs.vercel.app
Aztec DocusaurusExtractor docs.aztec.network
Noir DocusaurusExtractor noir-lang.org/docs
Zama Protocol ModernGitBookExtractor docs.zama.org/protocol
Zama Solidity ModernGitBookExtractor docs.zama.org/protocol/solidity-guides

Development & Debugging

For coding agents and contributors working on extractor improvements:

Resource Purpose
test.py Automated test runner that downloads 8 reference documentation sites
test-prompt.md Structured prompt for coding agents with verification checklist and debugging workflow
test-screenshots/ Reference screenshots of expected sidebar/TOC structure for each test site

Workflow for Fixing Extractors

  1. Run poetry run python test.py to generate test output
  2. Compare generated tests-N/*.md files against screenshots in test-screenshots/
  3. Use test-prompt.md as a guide for systematic verification
  4. Fix issues in gitbook_downloader.py and re-run tests

Known Limitations

Collapsed Navigation (Docusaurus, Vocs)

Sites with JavaScript-rendered collapsed navigation may not capture all nav items in the static HTML. The extractor recursively crawls pages to discover content, but:

  • Nav link titles may differ from page H1 headings
  • Items like "Quick Start" under collapsed "Getting Started" sections may be missing from TOC
  • Example: Noir docs missing "Quick Start" because it's in a collapsed Docusaurus category

Client-Rendered Navigation (Modern GitBook)

Modern GitBook sites render navigation client-side. When the static sidebar has fewer than 10 items, the extractor supplements with content links and uses URL-based sorting to group related pages.

  • Example: Zama Protocol's FHE sub-items (library, host contracts, etc.) are correctly grouped under "FHE on blockchain" using URL-based sorting

Multiple Sidebar Sections (Docusaurus)

Docusaurus sites with multiple "docs plugins" (e.g., developer docs + operator docs) may include items from all sidebars in the initial extraction.

  • Example: Aztec docs include both developer docs and node operator sections (Setup, Operation) from the same sidebar
  • Workaround: Start from a more specific URL like /developers/ instead of the root

Vocs Expandable vs Non-Expandable Items

Vocs sites may have inconsistent depth for expandable items (with children) vs non-expandable items (simple links) due to different HTML structures.

  • Example: MetaLeX "BORGs OS" (expandable) may appear at different depth than "Borg Auth" (non-expandable)

Version Path Filtering

The extractor filters paths like /nightly/, /next/, /canary/ to avoid duplicating content from multiple doc versions. Some version-specific content may be skipped.

Doc Section Filtering

The extractor filters URLs that go into different documentation sections (e.g., /solidity-guides/ when starting from /protocol/). Recognized section prefixes include: developers, operators, nodes, guides, tutorials, api, reference, solidity-guides, relayer-sdk-guides, examples.

Adding New Extractors

To support a new documentation platform, create a class that extends NavExtractor:

class MyExtractor(NavExtractor):
    def can_handle(self, soup: BeautifulSoup) -> bool:
        # Return True if this extractor can handle the page
        return soup.find(class_="my-nav-class") is not None

    def extract(self, soup: BeautifulSoup, base_url: str, processed_urls: Set[str]) -> List[tuple]:
        # Return list of (url, title, depth) tuples
        # url can be None for section headers
        nav_links = []
        # ... extraction logic ...
        return nav_links

Then add it to the extractors list in GitbookDownloader.__init__().

Technical Details

The application uses:

  • aiohttp for async HTTP requests
  • BeautifulSoup4 for HTML parsing
  • markdownify for HTML to markdown conversion
  • Flask for the web interface
  • python-slugify for URL/filename handling

Architecture

GitbookDownloader
├── NavExtractor (ABC)
│   ├── MintlifyExtractor      - Mintlify documentation sites
│   ├── VocsExtractor          - Vocs documentation sites
│   ├── DocusaurusExtractor    - Docusaurus v2/v3 sites
│   ├── ModernGitBookExtractor - Next.js GitBook sites
│   ├── GitBookExtractor       - Traditional GitBook sites
│   └── FallbackExtractor      - Generic fallback for any site
├── _extract_nav_links()   - Runs extractors in priority order
├── _follow_nav_links()    - Recursively processes navigation
├── _process_page_content() - Extracts and cleans page content
└── _generate_markdown()   - Produces final markdown output

About

Download Gitbook documentation of any site as MD files to use with LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors