Skip to content

View cdx and warc files, caching them locally as needed

License

Notifications You must be signed in to change notification settings

The-Focus-AI/warc_viewer

Repository files navigation

WARC Viewer

A local caching frontend for browsing WARC archives, with SQLite database support for efficient CDX indexing and content caching. Defaults to browsing the ffffound.com archive.

Features

  • SQLite database for storing CDX entries, WARC file metadata, and cached content
  • Progress bars for file downloads with resume capability
  • Efficient content extraction from WARC files
  • Web interface for viewing and searching archived content
  • Caching of extracted content for faster subsequent access
  • Default support for ffffound.com archive

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/warc-viewer.git
cd warc-viewer
  1. Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install for development:
pip install -e ".[dev]"  # Install package in development mode with test dependencies

Or for regular use:

pip install .  # Install package normally

Usage

Run the application with default settings (ffffound.com archive):

warc-viewer  # If installed normally
# OR
python -m warc_viewer  # If installed in development mode

Or specify custom CDX and WARC URLs:

warc-viewer \
  --cdx-url https://example.com/archive.cdx.gz \
  --warc-base-url https://example.com/warcs/ \
  --cache-dir ~/.cache/warc-viewer \
  --host 127.0.0.1 \
  --port 5000

Command Line Options

  • --cdx-url: URL to the CDX file (default: ffffound.com CDX)
  • --warc-base-url: Base URL for WARC files (default: ffffound.com WARC base URL)
  • --cache-dir: Local cache directory (default: ~/.cache/warc-viewer)
  • --host: Host to bind to (default: 127.0.0.1)
  • --port: Port to listen on (default: 5000)
  • --debug: Enable debug mode
  • --log-level: Set logging level (default: INFO)

Development

Running Tests

Make sure you've installed the package with development dependencies:

pip install -e ".[dev]"  # If not already done
pytest tests/  # Run tests
pytest --cov=warc_viewer tests/  # Run tests with coverage

Project Structure

src/warc_viewer/
├── __init__.py
├── __main__.py     # Entry point
├── app.py          # Flask application
├── db.py           # Database operations
├── cdx.py          # CDX file handling
├── warc.py         # WARC file operations
└── downloader.py   # File download utilities

tests/
├── __init__.py
├── test_db.py
├── test_cdx.py
└── test_warc.py

License

MIT License

About

View cdx and warc files, caching them locally as needed

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published