web2embeddings

A complete pipeline for downloading, processing, and creating vector embeddings from documentation. This project allows you to create searchable semantic embeddings from web content and visualize them in 2D/3D space.

Note: The curated Godot website is included in the repository (artifacts/curated/godotengine) as it takes a long time to download and process. You can start directly with the text chunking step if you want to work with the Godot documentation.

TODO

I recently learnt that the Godot website was actually made from the godot-docs repository, so one does not need to crawl the website, just download the repo, which makes it a lot easier to keep up to date, which means the "curator" needs to be updated to support it.

Installation

Clone this repository:

git clone https://github.com/zivshek/web2embeddings.git
cd web2embedding

Install dependencies:
```
pip install -r requirements.txt
```

Getting Started

The repository includes pre-curated Godot documentation, so you can quickly try out the pipeline:

Quick Start with Existing Data

Generate text chunks from the curated Godot documentation:

python chunker.py --input artifacts/curated/godotengine --chunk-size 400 --chunk-overlap 20

Create vector embeddings:
```
python vectorizer.py --input artifacts/chunks/godotengine_chunks_SZ_400_O_20.jsonl
```
it will log the collection name to artifacts/collections.txt, so you can copy it for other uses later.

Visualize the embeddings:

python visualizer.py --collection godotengine_chunks_SZ_400_O_20_sentence-transformers_all-MiniLM-L6-v2

Full Pipeline with New Content

To process new web content from scratch:

Add URLs to websites_to_download.txt, then download the content:
```
python downloader.py
```
Follow steps 2-5 in the Workflow section below.

Overview

This project provides a comprehensive workflow for:

Downloading web content (such as Godot documentation)
Curating the content by cleaning and converting to markdown
Chunking the text into manageable segments
Creating vector embeddings of the text chunks
Visualizing the embeddings in 2D/3D space

Prerequisites

Python 3.8+
Required Python packages (installed via requirements.txt)
Sufficient disk space for downloaded content and vector database

Project Structure

website2embedding/
├── downloader.py       # Downloads web content
├── page_curator.py     # Cleans HTML and converts to markdown
├── chunker.py          # Splits text into chunks
├── vectorizer.py       # Creates vector embeddings
├── visualizer.py       # Visualizes embeddings in 2D/3D
├── websites_to_download.txt  # List of websites to download
├── artifacts/         # Directory for all generated files
    ├── downloaded_sites/  # Raw downloaded HTML
    ├── curated/         # Cleaned markdown files
    ├── chunks/          # Text chunks in JSONL format
    ├── chroma_db/       # Vector database
    └── visualizations/  # 2D/3D visualizations

Workflow

1. Download Web Content

Downloads HTML content from websites listed in websites_to_download.txt:

python downloader.py --delay 1.0

Options:

--delay / -d: Delay between requests in seconds (default: 1.0)

2. Curate Content

Clean HTML and convert to markdown format:

python curator.py --input artifacts/downloaded_sites/site_domain

Options:

--input / -i: Input directory with downloaded HTML

3. Create Text Chunks

Split markdown files into manageable chunks:

python chunker.py --input artifacts/curated/site_domain --chunk-size 400 --chunk-overlap 20

Options:

--input / -i: Input directory with markdown files
--chunk-size / -s: Maximum size of chunks in characters (default: 400)
--chunk-overlap / -v: Overlap between chunks in characters (default: 20)

4. Create Vector Embeddings

Generate embeddings and store in ChromaDB:

python vectorizer.py --input artifacts/chunks/chunks_SZ_400_O_20.jsonl --db artifacts/vector_stores/chroma_db

Options:

--input / -i: Input JSONL file containing text chunks
--db / -d: Directory for ChromaDB vector database (default: artifacts/vector_stores/chroma_db)
--model / -m: Name of the sentence-transformer model (default: sentence-transformers/all-MiniLM-L6-v2)
--batch-size / -b: Batch size for embedding generation (default: 32)

Note: The collection name will be logged to artifacts/vector_stores/collections.txt for later usages.

5. Visualize Embeddings

Create interactive 2D/3D visualizations of embeddings:

python visualizer.py --collection chunks_SZ_400_O_20_sentence-transformers_all-MiniLM-L6-v2

Options:

--db / -d: ChromaDB database directory (default: artifacts/vector_stores/chroma_db)
--collection / -c: Name of the collection in ChromaDB
--max-points / -m: Maximum points to visualize (default: 2000)
--seed / -s: Random seed for reproducibility (default: 42)
--clusters / -k: Number of clusters for coloring (default: 10)

Example Use Case

This project was designed to process Godot documentation, but can be adapted for any web content. Potential applications include:

Technical documentation exploration
Semantic search engines
Content organization and discovery
Document similarity analysis

Notes

The .gitignore is set up to exclude the artifacts directory to avoid committing large files.
For large websites, consider adjusting the delay in downloader.py to avoid rate limiting.
Vector embeddings require significant memory for large collections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

web2embeddings

TODO

Installation

Getting Started

Quick Start with Existing Data

Full Pipeline with New Content

Overview

Prerequisites

Project Structure

Workflow

1. Download Web Content

2. Curate Content

3. Create Text Chunks

4. Create Vector Embeddings

5. Visualize Embeddings

Example Use Case

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
artifacts/curated/godotengine/en/stable		artifacts/curated/godotengine/en/stable
assets		assets
.gitignore		.gitignore
README.md		README.md
README_cn.md		README_cn.md
chunker.py		chunker.py
downloader.py		downloader.py
page_curator.py		page_curator.py
requirements.txt		requirements.txt
vectorizer.py		vectorizer.py
visualizer.py		visualizer.py
websites_to_download.txt		websites_to_download.txt

zivshek/web2embeddings

Folders and files

Latest commit

History

Repository files navigation

web2embeddings

TODO

Installation

Getting Started

Quick Start with Existing Data

Full Pipeline with New Content

Overview

Prerequisites

Project Structure

Workflow

1. Download Web Content

2. Curate Content

3. Create Text Chunks

4. Create Vector Embeddings

5. Visualize Embeddings

Example Use Case

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages