Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions .github/workflows/publish-to-pypi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# GitHub Actions workflow to build and publish to PyPI
name: Build and Publish Python 🐍📦

on:
release:
types:
- published

jobs:
build-and-publish:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine

- name: Build package
run: |
python -m build

- name: Publish to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
run: |
twine upload dist/*

- name: Publish to TestPyPI (optional)
if: github.ref_type == 'branch' && github.ref_name == 'main'
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TEST_PYPI_API_TOKEN }}
run: |
twine upload --repository testpypi dist/*
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,6 @@ __pycache__
.cursorignore
/test.py
.aider*
*.egg-info
.pytest_cache
.ruff_cache
93 changes: 59 additions & 34 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,59 +1,84 @@
# Use Dockerfile syntax version 1.5 for compatibility and new features
# syntax=docker/dockerfile:1.5
# syntax=docker/dockerfile:1

FROM python:3.13 AS builder
# ==============================================================================
# Base Stage: Installs uv and creates a non-root user for security
# ==============================================================================
FROM python:3.13-slim AS base

# Set non-interactive mode
ENV DEBIAN_FRONTEND=noninteractive
ENV UV_COMPILE_BYTECODE=1
ENV UV_LINK_MODE=copy

# Prevent docker from cleaning up the apt cache
RUN rm -f /etc/apt/apt.conf.d/docker-clean
# Install uv, the modern Python package manager
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Define ARG for platform-specific cache separation
# Create a non-root user and group for enhanced security
RUN groupadd --system --gid 1001 app && \
useradd --system --uid 1001 --gid 1001 -m app

# ==============================================================================
# Builder Stage: Install system and Python dependencies with optimized caching
# ==============================================================================
FROM base AS builder

# Argument for multi-platform builds
ARG TARGETPLATFORM

# Update and install dependencies with cache separated by architecture
# CORRECTION: Argument to receive the application version from the host
ARG APP_VERSION=0.0.0

# Install build-time system dependencies using BuildKit cache mounts
RUN --mount=type=cache,target=/var/cache/apt,id=apt-cache-${TARGETPLATFORM} \
--mount=type=cache,target=/var/lib/apt,id=apt-lib-${TARGETPLATFORM} \
apt-get update && \
apt-get install --no-install-recommends -y libxml2-dev libxslt-dev
apt-get install -y --no-install-recommends \
libxml2-dev \
libxslt-dev

WORKDIR /app

COPY requirements.txt .
# Grant ownership to the non-root user before using it
RUN --mount=type=cache,target=/home/app/.cache/uv,uid=1001,gid=1001 \
chown -R app:app /app /home/app/.cache/uv

# Use pip cache to speed up builds
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt -t packages
USER app

# Start from a slim Python 3.12 image for a small final image size
FROM python:3.13-slim AS final
# Copy source code BEFORE installing dependencies, as setuptools-scm needs it
COPY --chown=app:app . .

# Set non-interactive mode
ENV DEBIAN_FRONTEND=noninteractive
# CORRECTION: Pass the version to setuptools-scm via an environment variable
# This tells setuptools-scm to use this version string instead of looking for .git
RUN --mount=type=cache,target=/home/app/.cache/uv,uid=1001,gid=1001 \
SETUPTOOLS_SCM_PRETEND_VERSION=${APP_VERSION} \
uv sync

# Prevent docker from cleaning up the apt cache in the final image
RUN rm -f /etc/apt/apt.conf.d/docker-clean
# ==============================================================================
# Final Stage: Assemble the lean production image
# ==============================================================================
FROM base AS final

ARG TARGETPLATFORM
WORKDIR /app

# Copy built packages from the previous stage
COPY --from=builder /app/packages /app/packages
# Activate the virtual environment by adding it to the PATH
ENV PATH="/app/.venv/bin:$PATH"

# Update and install runtime dependencies if necessary, with cache separated by architecture
# Install only essential runtime system dependencies
ARG TARGETPLATFORM
RUN --mount=type=cache,target=/var/cache/apt,id=apt-cache-${TARGETPLATFORM} \
--mount=type=cache,target=/var/lib/apt,id=apt-lib-${TARGETPLATFORM} \
apt-get update && \
apt-get full-upgrade -y && \
apt-get install -y --no-install-recommends libxml2 libxslt1.1 libtk8.6
apt-get install -y --no-install-recommends \
libxml2 \
libxslt1.1 \
libtk8.6

WORKDIR /app
# Copy the virtual environment and source code from previous stages
# The source code is already in the builder stage, no need for another COPY . .
COPY --from=builder --chown=app:app /app /app

ENV PYTHONPATH=/app/packages:$PYTHONPATH
# This must be done as root BEFORE switching to the non-root user
RUN mkdir -p /home/app/.cache/crawler-to-md && chown -R app:app /home/app/.cache/crawler-to-md

# Copy the rest of the application's source code into the working directory
COPY . .
# Switch to the non-root user for execution
USER app

VOLUME [ "/app/cache"]
VOLUME [ "/home/app/.cache/crawler-to-md" ]

ENTRYPOINT [ "python", "main.py" ]
ENTRYPOINT [ "/app/.venv/bin/crawler-to-md" ]
41 changes: 25 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,36 @@
# Web Scraper to Markdown 🌐✍️

This Python-based web scraper fetches content from URLs and exports it into Markdown and JSON formats, specifically designed for simplicity, extensibility, and for uploading JSON files to GPT models. Ideal for those looking to leverage web content for AI training or analysis. 🤖💡
This Python-based web scraper fetches content from URLs and exports it into Markdown and JSON formats, specifically designed for simplicity, extensibility, and for uploading JSON files to GPT models. It is ideal for those looking to leverage web content for AI training or analysis. 🤖💡

## 🚀 Quick Start

(Or even better, **[use Docker!](#-docker-support) 🐳**)

### Recommended installation using pipx (isolated environment)

```shell
pipx install crawler-to-md
```

### Alternatively, install with pip

```shell
git clone https://github.com/obeone/crawler-to-md.git
cd crawler-to-md
pip install -r requirements.txt
pip install crawler-to-md
```

Then run the scraper:

python main.py --url https://www.example.com
```shell
crawler-to-md --url https://www.example.com
```

## 🌟 Features

- Scrapes web pages for content and metadata. 📄
- Filters links by base URL. 🔍
- Excludes URLs containing certain strings. ❌
- Automatically find links or can use a file of URLs to scrape. 🔗
- Rate limiting and delay 🕘
- Automatically finds links or can use a file of URLs to scrape. 🔗
- Rate limiting and delay support. 🕘
- Exports data to Markdown and JSON, ready for GPT uploads. 📤
- Exports each page as an individual Markdown file if `--export-individual` is used. 📝
- Uses SQLite for efficient data management. 📊
Expand All @@ -29,21 +39,20 @@ python main.py --url https://www.example.com

## 📋 Requirements

Python 3.12 and the following packages:
Python 3.9 or higher is required.

- `requests`
- `beautifulsoup4`
- `trafilatura`
- `coloredlogs`
Project dependencies are managed with `pyproject.toml`. Install them with:

Install with `pip install -r requirements.txt`.
```shell
pip install .
```

## 🛠 Usage

Start scraping with the following command:

```shell
python main.py --url <URL> [--output-folder ./output] [--cache-folder ./cache] [--base-url <BASE_URL>] [--exclude <KEYWORD_IN_URL>] [--title <TITLE>] [--urls-file <URLS_FILE>]
crawler-to-md --url <URL> [--output-folder ./output] [--cache-folder ./cache] [--base-url <BASE_URL>] [--exclude <KEYWORD_IN_URL>] [--title <TITLE>] [--urls-file <URLS_FILE>]
```

Options:
Expand All @@ -59,11 +68,11 @@ Options:
- `--rate-limit`, `-rl`: Maximum number of requests per minute (default: 0, no rate limit). ⏱️
- `--delay`, `-d`: Delay between requests in seconds (default: 0, no delay). 🕒

One of the `--url` or `--urls-file` is required.
One of the `--url` or `--urls-file` options is required.

### 📚 Log level

By default, `WARN` level is used. You can change it with the `LOG_LEVEL` environment variable.
By default, the `WARN` level is used. You can change it with the `LOG_LEVEL` environment variable.

## 🐳 Docker Support

Expand Down
File renamed without changes.
Loading
Loading