A local caching frontend for browsing WARC archives, with SQLite database support for efficient CDX indexing and content caching. Defaults to browsing the ffffound.com archive.
- SQLite database for storing CDX entries, WARC file metadata, and cached content
- Progress bars for file downloads with resume capability
- Efficient content extraction from WARC files
- Web interface for viewing and searching archived content
- Caching of extracted content for faster subsequent access
- Default support for ffffound.com archive
- Clone the repository:
git clone https://github.com/yourusername/warc-viewer.git
cd warc-viewer
- Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install for development:
pip install -e ".[dev]" # Install package in development mode with test dependencies
Or for regular use:
pip install . # Install package normally
Run the application with default settings (ffffound.com archive):
warc-viewer # If installed normally
# OR
python -m warc_viewer # If installed in development mode
Or specify custom CDX and WARC URLs:
warc-viewer \
--cdx-url https://example.com/archive.cdx.gz \
--warc-base-url https://example.com/warcs/ \
--cache-dir ~/.cache/warc-viewer \
--host 127.0.0.1 \
--port 5000
--cdx-url
: URL to the CDX file (default: ffffound.com CDX)--warc-base-url
: Base URL for WARC files (default: ffffound.com WARC base URL)--cache-dir
: Local cache directory (default: ~/.cache/warc-viewer)--host
: Host to bind to (default: 127.0.0.1)--port
: Port to listen on (default: 5000)--debug
: Enable debug mode--log-level
: Set logging level (default: INFO)
Make sure you've installed the package with development dependencies:
pip install -e ".[dev]" # If not already done
pytest tests/ # Run tests
pytest --cov=warc_viewer tests/ # Run tests with coverage
src/warc_viewer/
├── __init__.py
├── __main__.py # Entry point
├── app.py # Flask application
├── db.py # Database operations
├── cdx.py # CDX file handling
├── warc.py # WARC file operations
└── downloader.py # File download utilities
tests/
├── __init__.py
├── test_db.py
├── test_cdx.py
└── test_warc.py
MIT License