Markdown Lab

Markdown Lab combines Python and Rust components to scrape websites and convert HTML content to markdown, JSON, or XML formats. It supports sitemap parsing, semantic chunking for RAG (Retrieval-Augmented Generation), and includes performance optimizations through Rust integration.

Key features include HTML-to-markdown/JSON/XML conversion with support for various elements (headers, links, images, lists, code blocks), content chunking that preserves document structure, and systematic content discovery through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.

Check out deepwiki for a detailed breakdown of the repository.

Features

CLI Interface: Terminal output with Typer and Rich, progress bars, and interactive features
Terminal User Interface (TUI): Interactive interface for complex operations
Web Scraping: Scrapes accessible websites with error handling and rate limiting
Sitemap Integration: Parses sitemap.xml to discover and scrape relevant content
Multiple Output Formats: Converts HTML to Markdown, JSON, or XML formats
RAG Chunking: Implements chunking for Retrieval-Augmented Generation systems
Parallel Processing: Batch URL processing with configurable worker threads
HTML Support:
- Headers (h1-h6)
- Paragraphs
- Links with resolved relative URLs
- Images with resolved relative URLs
- Ordered and unordered lists
- Blockquotes
- Code blocks
Real-time Progress: Live progress tracking with terminal output
Document Structure Preservation: Maintains semantic structure during conversion
Logging: Detailed logging with configurable levels
Error Handling: Exponential backoff and graceful failure recovery
Performance: Rust-powered HTML parsing with cached selectors
Configuration: Unified configuration management system
Connection Pooling: Efficient HTTP client with connection reuse

Installation

git clone https://github.com/ursisterbtw/markdown_lab.git
cd markdown_lab

# Quick setup with justfile (recommended)
just setup

# Or manual setup using UV (Python 3.12+ required)
uv sync
source .venv/bin/activate
maturin develop

Usage

Modern CLI Interface (Recommended)

The project features a modern CLI built with Typer and Rich for terminal output:

# Convert single URLs
mlab convert "https://example.com" --output article.md --format markdown
mlab convert "https://docs.example.com" --format json --chunks --chunk-size 1500

# Batch convert with progress bars
mlab batch links.txt --output batch_results --parallel --max-workers 8

# Convert via sitemap discovery
mlab sitemap "https://example.com" --min-priority 0.7 --limit 50

# Launch interactive Terminal User Interface
mlab-tui

# Show system status
mlab status

# Manage configuration
mlab config

Interactive Features

# Convert with live progress updates
mlab convert "https://example.com" --interactive --output article.md

# Batch processing with rich progress bars
mlab batch links.txt --output results --interactive --parallel

# Content chunking for RAG applications
mlab convert "https://docs.example.com" --chunks --chunk-size 1500 --chunk-overlap 200

Legacy CLI Interface

The original interface is still available for compatibility:

# Using legacy CLI directly
MARKDOWN_LAB_LEGACY=1 python -m markdown_lab "https://example.com" --output article.md
# or
mlab-legacy "https://example.com" --output article.md

# Convert to different formats
mlab-legacy "https://example.com" --output output.json --format json
mlab-legacy "https://example.com" --output output.xml --format xml

Advanced Usage Examples

# Comprehensive sitemap scraping with modern CLI
mlab sitemap "https://example.com" \
    --min-priority 0.5 \
    --include "blog/*" "products/*" \
    --exclude "*.pdf" "temp/*" \
    --limit 50 \
    --chunks \
    --chunk-dir my_chunks \
    --rate-limit 2.0

# Parallel batch processing with progress tracking
mlab batch urls.txt \
    --output results \
    --parallel \
    --max-workers 8 \
    --format json \
    --interactive

Modern CLI Commands

Command	Description	Example
`mlab convert <url>`	Convert single URL	`mlab convert "https://example.com" --output article.md`
`mlab batch <file>`	Batch convert URLs from file	`mlab batch links.txt --parallel --max-workers 8`
`mlab sitemap <url>`	Convert via sitemap discovery	`mlab sitemap "https://example.com" --limit 50`
`mlab-tui`	Launch Terminal User Interface	`mlab-tui`
`mlab status`	Show system status	`mlab status`
`mlab config`	Manage configuration	`mlab config`

Command Options

Option	Description	Default
`--output, -o`	Output file/directory	auto-generated
`--format, -f`	Output format (markdown, json, xml)	`markdown`
`--chunks`	Enable content chunking for RAG	False
`--chunk-size`	Maximum chunk size (characters)	1500
`--chunk-overlap`	Overlap between chunks (characters)	200
`--chunk-dir`	Directory to save chunks	`chunks`
`--interactive, -i`	Enable interactive progress display	False
`--parallel`	Use parallel processing	False
`--max-workers`	Maximum parallel workers	4
`--rate-limit`	Rate limit for requests (req/sec)	2.0
`--min-priority`	Minimum sitemap URL priority	0.5
`--include`	URL patterns to include	None
`--exclude`	URL patterns to exclude	None
`--limit`	Maximum URLs to process	None

As a Module

Basic Scraping and Conversion

from markdown_lab.core.scraper import MarkdownScraper
from markdown_lab.core.config import MarkdownLabConfig

# Using centralized configuration
config = MarkdownLabConfig(
    requests_per_second=2.0,
    timeout=30,
    cache_enabled=True
)

# Using default Markdown format
scraper = MarkdownScraper(config)
html_content = scraper.scrape_website("https://example.com")
markdown_content = scraper.convert_to_markdown(html_content, "https://example.com")
scraper.save_content(markdown_content, "output.md")

# Using JSON or XML format with the Rust implementation
from markdown_lab import markdown_lab_rs

html_content = scraper.scrape_website("https://example.com")

# Convert to Markdown (legacy helper)
markdown_content = markdown_lab_rs.convert_html_to_markdown(
    html_content, "https://example.com"
)
scraper.save_content(markdown_content, "output.md")

# Convert to JSON or XML using string format names
json_content = markdown_lab_rs.convert_html_to_format(
    html_content, "https://example.com", "json"
)
scraper.save_content(json_content, "output.json")

xml_content = markdown_lab_rs.convert_html_to_format(
    html_content, "https://example.com", "xml"
)
scraper.save_content(xml_content, "output.xml")

# Note: An OutputFormat enum is exposed for convenience:
# from markdown_lab import markdown_lab_rs
# fmt = markdown_lab_rs.OutputFormat.from_str("json")  # returns an enum value
# The current Python bindings accept string names ("markdown"|"json"|"xml").

With Sitemap Discovery

from markdown_lab.core.scraper import MarkdownScraper

scraper = MarkdownScraper(requests_per_second=2.0)
# Scrape using sitemap discovery
scraped_urls = scraper.scrape_by_sitemap(
    base_url="https://example.com",
    output_dir="output_dir",
    min_priority=0.5,                  # Only URLs with priority >= 0.5
    include_patterns=["blog/*"],       # Only blog URLs
    exclude_patterns=["temp/*"],       # Exclude temporary pages
    limit=20,                          # Maximum 20 URLs
    save_chunks=True,                  # Enable chunking
    chunk_dir="my_chunks",             # Save chunks here
    chunk_format="jsonl"               # Use JSONL format
)
print(f"Successfully scraped {len(scraped_urls)} URLs")

Using Links File

from markdown_lab.core.scraper import MarkdownScraper

scraper = MarkdownScraper(requests_per_second=2.0)
# Scrape URLs from a links file
scraper.scrape_by_links_file(
    links_file="links.txt",        # File containing URLs to scrape
    output_dir="output_dir",       # Directory to save output files
    save_chunks=True,              # Enable chunking
    output_format="markdown",      # Output format (markdown, json, xml)
    parallel=True,                 # Enable parallel processing
    max_workers=8                  # Use 8 parallel workers
)

Direct Sitemap Access

from markdown_lab.utils.sitemap_utils import SitemapParser, discover_site_urls

# Quick discovery of URLs from sitemap
urls = discover_site_urls(
    base_url="https://example.com",
    min_priority=0.7,
    include_patterns=["products/*"],
    limit=10
)

# Or with more control
parser = SitemapParser()
parser.parse_sitemap("https://example.com")
urls = parser.filter_urls(min_priority=0.5)
parser.export_urls_to_file(urls, "sitemap_urls.txt")

Sitemap Integration Features

The library discovers and parses XML sitemaps to scrape exactly what you need:

Automatic Discovery: Finds sitemaps through robots.txt or common locations
Sitemap Index Support: Handles multi-level sitemap index files
Priority-Based Filtering: Choose URLs based on their priority in the sitemap
Pattern Matching: Include or exclude URLs with regex patterns
Optimized Scraping: Only scrape the pages that matter most
Structured Organization: Creates meaningful filenames based on URL paths

RAG Chunking Capabilities

The library implements chunking designed specifically for RAG (Retrieval-Augmented Generation) systems:

Semantic Chunking: Preserves the semantic structure of documents by chunking based on headers
Content-aware: Large sections are split into overlapping chunks for better context preservation
Metadata-rich: Each chunk contains detailed metadata for better retrieval
Multiple Formats: Save chunks as individual JSON files or as a single JSONL file
Customizable: Control chunk size and overlap to balance between precision and context

Testing

Using Justfile (Recommended)

# Run all tests (Rust + Python + integration)
just test

# Run specific test suites
just test-python          # Python tests only
just test-rust            # Rust tests only
just test-bindings        # Python binding tests
just test-integration     # Integration tests
just test-coverage        # Tests with coverage reporting

# Development workflow
just dev-cycle            # Quick build + test cycle
just full-cycle           # Complete build + lint + test

Raw Commands

# All tests
pytest -m "not integration"  # unit tests

# Run integration tests (Rust-backed path)
pytest -m integration

# Rust tests
cargo test
RUST_LOG=debug cargo test -- --nocapture

# Python binding tests
pytest tests/rust/test_python_bindings.py -v

# Unit tests
pytest tests/unit/

Running Benchmarks

# Using justfile
just bench                # All benchmarks
just bench-html          # HTML parsing benchmark
just bench-chunk         # Chunking benchmark
just bench-viz           # Visualize results

# Raw commands
cargo bench
cargo bench html_to_markdown
cargo bench chunk_markdown

# Visualize results
python scripts/visualize_benchmarks.py

Development

Justfile Commands

The project uses justfile for development workflows. Run just to see all commands:

# Setup and environment
just setup               # Complete development setup
just status              # Check project status
just clean               # Clean build artifacts
just update              # Update dependencies

# Building
just build-dev           # Development build
just build-release       # Optimized build
just build-js            # Build with JavaScript support

# Development workflows
just dev                 # Quick development mode
just dev-cycle           # Build + test bindings
just full-cycle          # Build + lint + test
just fix                 # Fix common issues

# Code quality
just lint                # Run all linting
just lint-python         # Python linting only
just lint-rust           # Rust linting only
just typecheck           # Type checking

# Demos and examples
just demo                # Format conversion demo
just hello               # Hello world example
just cli-test            # Test CLI functionality

Code Organization

markdown_lab/: Main Python package
- __init__.py: Package initialization
- __main__.py: Command-line entry point
- core/: Core functionality
  - scraper.py: Main scraper implementation
  - cache.py: Request caching
  - throttle.py: Rate limiting for web requests
- utils/: Utility modules
  - chunk_utils.py: Utilities for chunking text for RAG
  - sitemap_utils.py: Sitemap parsing and URL discovery
  - version.py: Version information
- markdown_lab_rs.py: Python interface to Rust components
src/: Rust source code
- lib.rs: Main library and Python bindings
- html_parser.rs: HTML parsing utilities
- markdown_converter.rs: HTML to Markdown conversion
- chunker.rs: Markdown chunking logic
- js_renderer.rs: JavaScript page rendering
tests/: Test files
- unit/: Python unit tests
- integration/: Integration tests
- rust/: Rust and Python binding tests
benches/: Benchmark files
- Performance tests for core operations
examples/: Example scripts and demos
- demo_formats.py: Demo of different output formats
- hello.py: Simple hello world example
docs/: Documentation
- Various documentation files and guides
- assets/: Documentation assets like images

Running with Real JavaScript Rendering

To enable real JavaScript rendering with headless Chrome:

cargo build --release --features real_rendering

See docs/JS_RENDERING.md for more details.

Rust-Python module naming

The PyO3 extension is built and imported as markdown_lab.markdown_lab_rs (namespaced module).

In Python, import via from markdown_lab import markdown_lab_rs.
The internal wrapper markdown_lab/markdown_lab_rs.py calls functions on this module and provides Python fallbacks.
Do not import a top-level markdown_lab_rs package; it will fail to resolve.

Performance Considerations

HTML to Markdown conversion is optimized for medium to large documents with cached selectors
Chunking algorithm balances semantic coherence with performance
JavaScript rendering can be CPU and memory intensive
Unified HTTP client provides connection pooling and efficient request handling
Centralized configuration management reduces overhead and improves consistency

Dependencies

Core Dependencies

Python 3.12+: Required minimum Python version
requests: Web scraping and HTTP requests
beautifulsoup4: HTML parsing fallback
typer: Modern CLI framework with rich terminal output
rich: Terminal formatting and progress bars
textual: Terminal User Interface framework

Development Dependencies

pytest: Testing framework with benchmarking support
mypy: Type checking with strict configuration
ruff: Fast Python linter and formatter
maturin: Rust-Python integration
uv: Fast Python package manager

Rust Dependencies

pyo3: Python bindings
scraper: High-performance HTML parsing with cached selectors
serde: Serialization for JSON/XML output
once_cell: Cached selector compilation

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/my-feature)
Commit your changes (git commit -m 'Add my feature')
Push to the branch (git push origin feature/my-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

[✓] Completed

Add support for more HTML elements
Implement chunking for RAG
Add sitemap.xml parsing for systematic scraping
Add JSON and XML output formats
Optimize HTML parsing with cached selectors (40-50% performance improvement)
Centralized configuration management
Unified error hierarchy with structured exceptions
Unified HTTP client with connection pooling
Remove dead dependencies and fix version conflicts

🚧 In Progress

Async HTTP operations for parallel processing
Memory usage optimization in chunking algorithms
Module restructuring for better maintainability

[PLANNED] Planned

Add support for JavaScript-rendered pages
Implement custom markdown templates
Include CSS selector support
Enhanced caching with LRU eviction
Token bucket rate limiting

Author

ursister

Creating an Official Release

To create an official release, follow these steps:

Update Version Numbers:
- Update the version number in Cargo.toml, pyproject.toml, and markdown_lab/__init__.py to the new release version.
Commit Changes:
- Commit the changes to the version numbers and any other updates.

Tag the Release:

Create a new Git tag for the release:

git tag -a v1.0.0 -m "Release version 1.0.0"
git push origin v1.0.0

Push Changes:
- Push the changes to the main branch:
```
git push origin main
```
Create GitHub Release:
- Go to the GitHub repository and create a new release using the tag created in step 3.
Verify Release:
- Verify that the release has been created successfully and that all components are working as expected.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
.github		.github
benches		benches
docs		docs
examples		examples
markdown_lab		markdown_lab
scripts		scripts
src		src
test_data		test_data
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
.sourcery.yaml		.sourcery.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
PLANNING.md		PLANNING.md
README.md		README.md
TASKS.md		TASKS.md
justfile		justfile
links.txt		links.txt
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

ursisterbtw/markdown_lab

Folders and files

Latest commit

History

Repository files navigation

Markdown Lab

Features

Installation

Usage

Modern CLI Interface (Recommended)

Interactive Features

Legacy CLI Interface

Advanced Usage Examples

Modern CLI Commands

Command Options

As a Module

Basic Scraping and Conversion

With Sitemap Discovery

Using Links File

Direct Sitemap Access

Sitemap Integration Features

RAG Chunking Capabilities

Testing

Using Justfile (Recommended)

Raw Commands

Running Benchmarks

Development

Justfile Commands

Code Organization

Running with Real JavaScript Rendering

Rust-Python module naming

Performance Considerations

Dependencies

Core Dependencies

Development Dependencies

Rust Dependencies

Contributing

License

Roadmap

[✓] Completed

🚧 In Progress

[PLANNED] Planned

Author

Creating an Official Release

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 7

Uh oh!

Languages