RAG THWS Tool

Parts

Webscraper
- Disallow: /fileadmin/: Many PDFs are skipped due to this rule in robots.txt. ➔ Options:
  - Skip the rule for PDFs only (currently done carefully).
  - Or ask the university for explicit approval to scrape PDFs under /fileadmin/.
- Scrapy DeltaFetch: Use scrapy-deltafetch to avoid refetching already scraped pages. ➔ Benefits:
  - Faster, incremental crawls.
  - Only new or changed pages trigger updates (saves compute & storage).
- Handling database updates: ➔ Options for handling rescraped content:
  - Overwrite existing entries in Postgres based on URL primary key.
  - Add new versions of the same document and track versions separately.
  - Delete old versions before inserting new ones (if freshness is critical).
- RAG Pipeline consequences: ➔ If we replace/update data:
  - Need to rechunk the updated documents. <>
  - Need to recreate embeddings for the changed chunks.
  - Need to rebuild or update the knowledge graph (KG) accordingly.
- Chunking & KG refresh strategy: ➔ Options:
  - Always rebuild from scratch after a new crawl. (simpler, heavier)
  - Implement partial updates if only a small subset changed. (complex, efficient)
- Deployment: ➔ Options:
  - Schedule via Cronjobs and docker run
  - via orchestration: swarm/ k8s and cronjobs
  - master container which has access to the docker socket
Text Preprocessing & Chunking
- cleanup text
- make chunks fitting the context window of the llm
- overlapping 1 sentence, so the model gets the context
Vector storage
- with an embedding modell
- Qdrant as db
  - fast (in rust)
  - allows metadata
  - python library
Frontend

Ideas

Webapp for manual model comparison

Build a web interface that shows each question and its reference answer alongside anonymized model outputs (labeled A, B, C, …). Evaluators rank the blind responses by quality, producing unbiased, user-driven performance metrics.
Containerized scraper with DB backend

Package your crawler in Docker and connect it to a persistent database (e.g. PostgreSQL or MongoDB). Persist raw content, metadata and “seen URL” state to enable fast, incremental re-scrapes; ensure identical environments across dev, staging and production. Add Scrapy addon scrapy-deltafetch. It skips already-seen pages based on saved fingerprints, reducing redundant downloads

Running the stuff

Setup

To setup python follow this steps.

Then you need to set up the git pre-commit hooks to ensure that we use all the same formatting of the files.

Open the rag repository Folder in Terminal and install the python dependencies

pip install -r requirements.txt

How to run the scraper

cd thws_scraper && scrapy crawl thws

This will output the raw data and the chunked data in the thws_scraper folder. You can watch the progress on http://localhost:7000/live for html based table and http://localhost:7000/stats for the raw json data.

Load to Vector db

docker compose up -d
python3 embed_to_qdrant.py data/thws_data_chunks.json

Running

ollama serve
python3 query.py

Name		Name	Last commit message	Last commit date
Latest commit History 346 Commits
.github/workflows		.github/workflows
assets		assets
data		data
docs		docs
knowledgeMapper		knowledgeMapper
testing		testing
thws_scraper		thws_scraper
.env		.env
.envrc		.envrc
.flake8		.flake8
.gitignore		.gitignore
.justfile		.justfile
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
compose.yml		compose.yml
flake.lock		flake.lock
flake.nix		flake.nix
install_python.sh		install_python.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG THWS Tool

Parts

Ideas

Running the stuff

Setup

How to run the scraper

Load to Vector db

Running

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

dav354/askTHWS

Folders and files

Latest commit

History

Repository files navigation

RAG THWS Tool

Parts

Ideas

Running the stuff

Setup

How to run the scraper

Load to Vector db

Running

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages