You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+32-6
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,14 @@
1
1
# Doc2Vec
2
2
3
-
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites) and GitHub repositories, extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
3
+
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites), GitHub repositories, and local directories, extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
4
4
5
5
The primary goal is to prepare documentation content for Retrieval-Augmented Generation (RAG) systems or semantic search applications.
6
6
7
7
## Key Features
8
8
9
9
***Website Crawling:** Recursively crawls websites starting from a given base URL.
10
10
***GitHub Issues Integration:** Retrieves GitHub issues and comments, processing them into searchable chunks.
11
+
***Local Directory Processing:** Scans local directories for files, converts content to searchable chunks.
11
12
***Content Extraction:** Uses Puppeteer for rendering JavaScript-heavy pages and `@mozilla/readability` to extract the main article content.
12
13
***HTML to Markdown:** Converts extracted HTML to clean Markdown using `turndown`, preserving code blocks and basic formatting.
13
14
***Intelligent Chunking:** Splits Markdown content into manageable chunks based on headings and token limits, preserving context.
@@ -17,8 +18,8 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
17
18
***Qdrant:** A dedicated vector database, using the `@qdrant/js-client-rest`.
18
19
***Change Detection:** Uses content hashing to detect changes and only re-embeds and updates chunks that have actually been modified.
19
20
***Incremental Updates:** For GitHub sources, tracks the last run date to only fetch new or updated issues.
20
-
***Cleanup:** Removes obsolete chunks from the database corresponding to pages that are no longer found during a crawl.
21
-
***Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, database types, metadata, and other parameters.
21
+
***Cleanup:** Removes obsolete chunks from the database corresponding to pages or files that are no longer found during processing.
22
+
***Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, database types, metadata, and other parameters.
22
23
***Structured Logging:** Uses a custom logger (`logger.ts`) with levels, timestamps, colors, progress bars, and child loggers for clear execution monitoring.
23
24
24
25
## Prerequisites
@@ -72,7 +73,7 @@ Configuration is managed through two files:
72
73
**Structure:**
73
74
74
75
*`sources`: An array of source configurations.
75
-
*`type`: Either `'website'`or `'github'`
76
+
*`type`: Either `'website'`, `'github'`, or `'local_directory'`
76
77
77
78
For websites (`type: 'website'`):
78
79
*`url`: The starting URL for crawling the documentation site.
@@ -81,7 +82,14 @@ Configuration is managed through two files:
81
82
*`repo`: Repository name in the format `'owner/repo'` (e.g., `'istio/istio'`).
82
83
*`start_date`: (Optional) Starting date to fetch issues from (e.g., `'2025-01-01'`).
83
84
84
-
Common configuration for both types:
85
+
For local directories (`type: 'local_directory'`):
86
+
*`path`: Path to the local directory to process.
87
+
*`include_extensions`: (Optional) Array of file extensions to include (e.g., `['.md', '.txt']`). Defaults to `['.md', '.txt', '.html', '.htm']`.
88
+
*`exclude_extensions`: (Optional) Array of file extensions to exclude.
89
+
*`recursive`: (Optional) Whether to traverse subdirectories (defaults to `true`).
90
+
*`encoding`: (Optional) File encoding to use (defaults to `'utf8'`).
91
+
92
+
Common configuration for all types:
85
93
*`product_name`: A string identifying the product (used in metadata).
86
94
*`version`: A string identifying the product version (used in metadata).
87
95
*`max_size`: Maximum raw content size (in characters). For websites, this limits the raw HTML fetched by Puppeteer. Recommending 1MB (1048576).
@@ -121,6 +129,19 @@ Configuration is managed through two files:
121
129
params:
122
130
db_path: './istio-issues.db'
123
131
132
+
# Local directory source example
133
+
- type: 'local_directory'
134
+
product_name: 'project-docs'
135
+
version: 'current'
136
+
path: './docs'
137
+
include_extensions: ['.md', '.txt']
138
+
recursive: true
139
+
max_size: 1048576
140
+
database_config:
141
+
type: 'sqlite'
142
+
params:
143
+
db_path: './project-docs.db'
144
+
124
145
# Qdrant example
125
146
- type: 'website'
126
147
product_name: 'Istio'
@@ -158,6 +179,7 @@ The script will then:
158
179
5. Process each source according to its type:
159
180
- For websites: Crawl the site, extract content, convert to Markdown
160
181
- For GitHub repos: Fetch issues and comments, convert to Markdown
182
+
- For local directories: Scan files, process content (converting HTML to Markdown if needed)
161
183
6. For all sources: Chunk content, check forchanges, generate embeddings (if needed), and store/updatein the database.
162
184
7. Cleanup obsolete chunks.
163
185
8. Output detailed logs.
@@ -192,7 +214,11 @@ The script will then:
192
214
* Fetch issues and comments using the GitHub API.
193
215
* Convert to formatted Markdown.
194
216
* Track last run date to support incremental updates.
195
-
3. **Process Content:** For each processed page or issue:
217
+
- **For Local Directories:**
218
+
* Recursively scan directories for files matching the configured extensions.
219
+
* Read file content, converting HTML to Markdown if needed.
220
+
* Process each file's content.
221
+
3. **Process Content:** For each processed page, issue, or file:
196
222
* **Chunk:** Split Markdown into smaller `DocumentChunk` objects based on headings and size.
197
223
* **Hash Check:** Generate a hash of the chunk content. Check if a chunk with the same ID exists in the DB and if its hash matches.
198
224
* **Embed (if needed):** If the chunk is new or changed, call the OpenAI API (`createEmbeddings`) to get the vector embedding.
0 commit comments