CRAPP - Content RAG Preparation Pipeline

A modular pipeline for scraping, parsing, and processing content into RAG-ready JSON artifacts.

Primary Interface: Web API at localhost:9090

Features

Web API with HTML UI for interactive processing
URL scraping (HTML snapshots, main content extraction, PDF detection)
PDF parsing (via pdfplumber)
Local caching (cache/raw for raw HTML/PDF text)
Sliding window processing via SecureChatAI (REDCap proxy)
Deduplication & canonical JSON output (cache/rag_ready)
GCS storage integration (optional)
Modular design (rag_pipeline/ package with submodules)

Pipeline Flow

flowchart LR
    A[URLs] --> B[Scraper/PDF Parser] --> C[cache/raw]
    C --> D[Sliding Window] --> E[cache/rag_ready JSON]

Quick Start

Clone the repo

Create .env with your credentials:

REDCAP_API_URL=https://your-redcap-instance/api/
REDCAP_API_TOKEN=your_token_here
GCS_BUCKET=your-bucket-name  # optional

Build and run:
```
docker-compose build
docker-compose up
```
Open http://localhost:9090 in your browser

Usage

Web API (Primary)

Start the server and visit http://localhost:9090:

docker-compose up

The web UI allows you to:

Enter URLs to process
Configure prompts
Toggle link-following
Upload documents (PDF, DOCX, TXT)

API Endpoints

POST /run - Process URLs

curl -X POST http://localhost:9090/run \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "follow_links": true}'

Response:

{
  "status": "completed",
  "run_id": "crapp_2026-01-06T18-30-00Z_a1b2c3d4",
  "output_path": "cache/rag_ready/crapp_2026-01-06T18-30-00Z_a1b2c3d4.json",
  "stats": {"documents_processed": 1, "total_sections": 5, ...},
  "warnings": []
}

POST /upload - Upload and process a document

GET /health - Health check

CLI

Run the pipeline from command line:

docker-compose run --rm scraper python -m rag_pipeline.main

Output Format

CRAPP produces a single canonical JSON file per run at cache/rag_ready/{run_id}.json.

Schema version: crapp.v1

{
  "schema_version": "crapp.v1",
  "crapp_version": "0.2.0",
  "run": {
    "run_id": "crapp_2026-01-06T18-30-00Z_a1b2c3d4",
    "timestamp_start": "2026-01-06T18:30:00Z",
    "timestamp_end": "2026-01-06T18:32:15Z",
    "triggered_by": "web_api",
    "run_mode": "deterministic",
    "follow_links": true,
    "tags": []
  },
  "documents": [...],
  "aggregate_stats": {...},
  "warnings": [...]
}

Project Structure

.
├── cache/
│   ├── raw/           # raw scraped HTML/PDF text
│   └── rag_ready/     # canonical JSON output
├── config/
│   ├── urls.txt       # default URL list
│   └── sliding_window_prompts.json
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── README.md
├── CLAUDE.md
└── rag_pipeline/
    ├── web.py              # FastAPI web interface (primary)
    ├── main.py             # CLI entrypoint + run_pipeline()
    ├── cli.py              # Interactive CLI
    ├── output_json.py      # Canonical JSON writer
    ├── scraping/
    │   ├── scraper.py
    │   └── pdf_parser.py
    ├── processing/
    │   ├── ai_client.py    # SecureChatAI proxy
    │   └── sliding_window.py
    ├── storage/
    │   └── storage.py
    └── utils/
        └── logger.py

Environment Variables

Variable	Required	Description
`REDCAP_API_URL`	Yes	REDCap API endpoint for SecureChatAI
`REDCAP_API_TOKEN`	Yes	REDCap API token
`GCS_BUCKET`	No	GCS bucket for artifact upload
`STORAGE_MODE`	No	`local` (default) or `gcs`

Run Modes

Mode	Description
`deterministic`	Pure text extraction, no AI calls (default)
`ai_auto`	AI triggered by noise detection heuristics
`ai_always`	Every chunk passes through AI normalization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CRAPP - Content RAG Preparation Pipeline

Features

Pipeline Flow

Quick Start

Usage

Web API (Primary)

API Endpoints

CLI

Output Format

Project Structure

Environment Variables

Run Modes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
rag_pipeline		rag_pipeline
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

susom/rag_scrape_pipeline

Folders and files

Latest commit

History

Repository files navigation

CRAPP - Content RAG Preparation Pipeline

Features

Pipeline Flow

Quick Start

Usage

Web API (Primary)

API Endpoints

CLI

Output Format

Project Structure

Environment Variables

Run Modes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages