A complete pipeline for downloading, processing, and creating vector embeddings from documentation. This project allows you to create searchable semantic embeddings from web content and visualize them in 2D/3D space.
Note: The curated Godot website is included in the repository (
artifacts/curated/godotengine
) as it takes a long time to download and process. You can start directly with the text chunking step if you want to work with the Godot documentation.
I recently learnt that the Godot website was actually made from the godot-docs repository, so one does not need to crawl the website, just download the repo, which makes it a lot easier to keep up to date, which means the "curator" needs to be updated to support it.
-
Clone this repository:
git clone https://github.com/zivshek/web2embeddings.git cd web2embedding
-
Install dependencies:
pip install -r requirements.txt
The repository includes pre-curated Godot documentation, so you can quickly try out the pipeline:
-
Generate text chunks from the curated Godot documentation:
python chunker.py --input artifacts/curated/godotengine --chunk-size 400 --chunk-overlap 20
-
Create vector embeddings:
python vectorizer.py --input artifacts/chunks/godotengine_chunks_SZ_400_O_20.jsonl
it will log the collection name to artifacts/collections.txt, so you can copy it for other uses later.
-
Visualize the embeddings:
python visualizer.py --collection godotengine_chunks_SZ_400_O_20_sentence-transformers_all-MiniLM-L6-v2
To process new web content from scratch:
-
Add URLs to
websites_to_download.txt
, then download the content:python downloader.py
-
Follow steps 2-5 in the Workflow section below.
This project provides a comprehensive workflow for:
- Downloading web content (such as Godot documentation)
- Curating the content by cleaning and converting to markdown
- Chunking the text into manageable segments
- Creating vector embeddings of the text chunks
- Visualizing the embeddings in 2D/3D space
- Python 3.8+
- Required Python packages (installed via requirements.txt)
- Sufficient disk space for downloaded content and vector database
website2embedding/
├── downloader.py # Downloads web content
├── page_curator.py # Cleans HTML and converts to markdown
├── chunker.py # Splits text into chunks
├── vectorizer.py # Creates vector embeddings
├── visualizer.py # Visualizes embeddings in 2D/3D
├── websites_to_download.txt # List of websites to download
├── artifacts/ # Directory for all generated files
├── downloaded_sites/ # Raw downloaded HTML
├── curated/ # Cleaned markdown files
├── chunks/ # Text chunks in JSONL format
├── chroma_db/ # Vector database
└── visualizations/ # 2D/3D visualizations
Downloads HTML content from websites listed in websites_to_download.txt
:
python downloader.py --delay 1.0
Options:
--delay
/-d
: Delay between requests in seconds (default: 1.0)
Clean HTML and convert to markdown format:
python curator.py --input artifacts/downloaded_sites/site_domain
Options:
--input
/-i
: Input directory with downloaded HTML
Split markdown files into manageable chunks:
python chunker.py --input artifacts/curated/site_domain --chunk-size 400 --chunk-overlap 20
Options:
--input
/-i
: Input directory with markdown files--chunk-size
/-s
: Maximum size of chunks in characters (default: 400)--chunk-overlap
/-v
: Overlap between chunks in characters (default: 20)
Generate embeddings and store in ChromaDB:
python vectorizer.py --input artifacts/chunks/chunks_SZ_400_O_20.jsonl --db artifacts/vector_stores/chroma_db
Options:
--input
/-i
: Input JSONL file containing text chunks--db
/-d
: Directory for ChromaDB vector database (default: artifacts/vector_stores/chroma_db)--model
/-m
: Name of the sentence-transformer model (default: sentence-transformers/all-MiniLM-L6-v2)--batch-size
/-b
: Batch size for embedding generation (default: 32)
Note: The collection name will be logged to artifacts/vector_stores/collections.txt for later usages.
Create interactive 2D/3D visualizations of embeddings:
python visualizer.py --collection chunks_SZ_400_O_20_sentence-transformers_all-MiniLM-L6-v2
Options:
--db
/-d
: ChromaDB database directory (default: artifacts/vector_stores/chroma_db)--collection
/-c
: Name of the collection in ChromaDB--max-points
/-m
: Maximum points to visualize (default: 2000)--seed
/-s
: Random seed for reproducibility (default: 42)--clusters
/-k
: Number of clusters for coloring (default: 10)
This project was designed to process Godot documentation, but can be adapted for any web content. Potential applications include:
- Technical documentation exploration
- Semantic search engines
- Content organization and discovery
- Document similarity analysis
- The
.gitignore
is set up to exclude the artifacts directory to avoid committing large files. - For large websites, consider adjusting the delay in
downloader.py
to avoid rate limiting. - Vector embeddings require significant memory for large collections.