Skip to content

Commit 251b134

Browse files
authored
Merge pull request #3 from kagent-dev/denis-github-source
Adding github source
2 parents 5a81dc7 + 16684ee commit 251b134

File tree

3 files changed

+618
-148
lines changed

3 files changed

+618
-148
lines changed

README.md

+84-43
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
# Doc2Vec
22

3-
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites), extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
3+
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites) and GitHub repositories, extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
44

55
The primary goal is to prepare documentation content for Retrieval-Augmented Generation (RAG) systems or semantic search applications.
66

77
## Key Features
88

99
* **Website Crawling:** Recursively crawls websites starting from a given base URL.
10+
* **GitHub Issues Integration:** Retrieves GitHub issues and comments, processing them into searchable chunks.
1011
* **Content Extraction:** Uses Puppeteer for rendering JavaScript-heavy pages and `@mozilla/readability` to extract the main article content.
1112
* **HTML to Markdown:** Converts extracted HTML to clean Markdown using `turndown`, preserving code blocks and basic formatting.
1213
* **Intelligent Chunking:** Splits Markdown content into manageable chunks based on headings and token limits, preserving context.
@@ -15,8 +16,9 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
1516
* **SQLite:** Using `better-sqlite3` and the `sqlite-vec` extension for efficient vector search.
1617
* **Qdrant:** A dedicated vector database, using the `@qdrant/js-client-rest`.
1718
* **Change Detection:** Uses content hashing to detect changes and only re-embeds and updates chunks that have actually been modified.
19+
* **Incremental Updates:** For GitHub sources, tracks the last run date to only fetch new or updated issues.
1820
* **Cleanup:** Removes obsolete chunks from the database corresponding to pages that are no longer found during a crawl.
19-
* **Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, database types, metadata, and other parameters.
21+
* **Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, database types, metadata, and other parameters.
2022
* **Structured Logging:** Uses a custom logger (`logger.ts`) with levels, timestamps, colors, progress bars, and child loggers for clear execution monitoring.
2123

2224
## Prerequisites
@@ -25,6 +27,7 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
2527
* **npm:** Node Package Manager (usually comes with Node.js).
2628
* **TypeScript:** As the project is written in TypeScript (`ts-node` is used for execution via `npm start`).
2729
* **OpenAI API Key:** You need an API key from OpenAI to generate embeddings.
30+
* **GitHub Personal Access Token:** Required for accessing GitHub issues (set as `GITHUB_PERSONAL_ACCESS_TOKEN` in your environment).
2831
* **(Optional) Qdrant Instance:** If using the `qdrant` database type, you need a running Qdrant instance accessible from where you run the script.
2932
* **(Optional) Build Tools:** Dependencies like `better-sqlite3` and `sqlite-vec` might require native compilation, which could necessitate build tools like `python`, `make`, and a C++ compiler (like `g++` or Clang) depending on your operating system.
3033

@@ -56,50 +59,81 @@ Configuration is managed through two files:
5659
# Required: Your OpenAI API Key
5760
OPENAI_API_KEY="sk-..."
5861
62+
# Required for GitHub sources
63+
GITHUB_PERSONAL_ACCESS_TOKEN="ghp_..."
64+
5965
# Optional: Required only if using Qdrant
6066
QDRANT_API_KEY="your-qdrant-api-key"
6167
```
6268

6369
2. **`config.yaml` file:**
64-
This file defines the sites to crawl and how to process them. Create a `config.yaml` file (or use a different name and pass it as an argument).
70+
This file defines the sources to process and how to handle them. Create a `config.yaml` file (or use a different name and pass it as an argument).
6571

6672
**Structure:**
6773

68-
* `sites`: An array of site configurations.
74+
* `sources`: An array of source configurations.
75+
* `type`: Either `'website'` or `'github'`
76+
77+
For websites (`type: 'website'`):
6978
* `url`: The starting URL for crawling the documentation site.
70-
* `database_type`: Specifies the storage backend (`sqlite` or `qdrant`).
79+
80+
For GitHub repositories (`type: 'github'`):
81+
* `repo`: Repository name in the format `'owner/repo'` (e.g., `'istio/istio'`).
82+
* `start_date`: (Optional) Starting date to fetch issues from (e.g., `'2025-01-01'`).
83+
84+
Common configuration for both types:
7185
* `product_name`: A string identifying the product (used in metadata).
7286
* `version`: A string identifying the product version (used in metadata).
73-
* `max_size`: Maximum **raw HTML content size** (in characters) fetched by Puppeteer. If a page's initial HTML exceeds this, it will be skipped *before* performing expensive DOM parsing, Readability, and Markdown conversion. Recommending 1MB (1048576).
74-
* `database_params`: Parameters specific to the chosen `database_type`.
75-
* For `sqlite`:
76-
* `db_path`: (Optional) Path to the SQLite database file. Defaults to `./<product_name>-<version>.db`.
77-
* For `qdrant`:
78-
* `qdrant_url`: (Optional) URL of your Qdrant instance. Defaults to `http://localhost:6333`.
79-
* `qdrant_port`: (Optional) Port for the Qdrant REST API. Defaults to `443` if `qdrant_url` starts with `https`, otherwise `6333`.
80-
* `collection_name`: (Optional) Name of the Qdrant collection to use. Defaults to `<product_name>_<version>` (lowercased, spaces replaced with underscores).
87+
* `max_size`: Maximum raw content size (in characters). For websites, this limits the raw HTML fetched by Puppeteer. Recommending 1MB (1048576).
88+
* `database_config`: Configuration for the database.
89+
* `type`: Specifies the storage backend (`'sqlite'` or `'qdrant'`).
90+
* `params`: Parameters specific to the chosen database type.
91+
* For `sqlite`:
92+
* `db_path`: (Optional) Path to the SQLite database file. Defaults to `./<product_name>-<version>.db`.
93+
* For `qdrant`:
94+
* `qdrant_url`: (Optional) URL of your Qdrant instance. Defaults to `http://localhost:6333`.
95+
* `qdrant_port`: (Optional) Port for the Qdrant REST API. Defaults to `443` if `qdrant_url` starts with `https`, otherwise `6333`.
96+
* `collection_name`: (Optional) Name of the Qdrant collection to use. Defaults to `<product_name>_<version>` (lowercased, spaces replaced with underscores).
8197

8298
**Example (`config.yaml`):**
8399
```yaml
84-
sites:
85-
- url: 'https://argo-cd.readthedocs.io/en/stable/'
86-
database_type: 'sqlite'
100+
sources:
101+
# Website source example
102+
- type: 'website'
87103
product_name: 'argo'
88104
version: 'stable'
105+
url: 'https://argo-cd.readthedocs.io/en/stable/'
89106
max_size: 1048576
90-
database_params:
91-
db_path: './vector-dbs/argo-cd.db'
92-
93-
- url: 'https://istio.io/latest/docs/'
94-
database_type: 'qdrant'
107+
database_config:
108+
type: 'sqlite'
109+
params:
110+
db_path: './vector-dbs/argo-cd.db'
111+
112+
# GitHub repository source example
113+
- type: 'github'
114+
product_name: 'istio'
115+
version: 'latest'
116+
repo: 'istio/istio'
117+
start_date: '2025-01-01'
118+
max_size: 1048576
119+
database_config:
120+
type: 'sqlite'
121+
params:
122+
db_path: './istio-issues.db'
123+
124+
# Qdrant example
125+
- type: 'website'
95126
product_name: 'Istio'
96127
version: 'latest'
128+
url: 'https://istio.io/latest/docs/'
97129
max_size: 1048576
98-
database_params:
99-
qdrant_url: 'https://your-qdrant-instance.cloud'
100-
qdrant_port: 6333
101-
collection_name: 'istio_docs_latest'
102-
# ... more sites
130+
database_config:
131+
type: 'qdrant'
132+
params:
133+
qdrant_url: 'https://your-qdrant-instance.cloud'
134+
qdrant_port: 6333
135+
collection_name: 'istio_docs_latest'
136+
# ... more sources
103137
```
104138

105139
## Usage
@@ -119,42 +153,49 @@ If no path is provided, the script defaults to looking for `config.yaml` in the
119153
The script will then:
120154
1. Load the configuration.
121155
2. Initialize the structured logger.
122-
3. Iterate through each site defined in the config.
156+
3. Iterate through each source defined in the config.
123157
4. Initialize the specified database connection.
124-
5. Crawl the site.
125-
6. For each valid page: extract content, convert to Markdown, chunk, check for changes, generate embeddings (if needed), and store/update in the database.
158+
5. Process each source according to its type:
159+
- For websites: Crawl the site, extract content, convert to Markdown
160+
- For GitHub repos: Fetch issues and comments, convert to Markdown
161+
6. For all sources: Chunk content, check for changes, generate embeddings (if needed), and store/update in the database.
126162
7. Cleanup obsolete chunks.
127163
8. Output detailed logs.
128164

129165
## Database Options
130166

131-
### SQLite (`database_type: 'sqlite'`)
167+
### SQLite (`database_config.type: 'sqlite'`)
132168
* Uses `better-sqlite3` and `sqlite-vec`.
133169
* Requires `db_path`.
134170
* Native compilation might be needed.
135171

136-
### Qdrant (`database_type: 'qdrant'`)
172+
### Qdrant (`database_config.type: 'qdrant'`)
137173
* Uses `@qdrant/js-client-rest`.
138174
* Requires `qdrant_url`, `qdrant_port`, `collection_name` and potentially `QDRANT_API_KEY`.
139175

140176
## Core Logic Flow
141177

142178
1. **Load Config:** Read and parse `config.yaml`.
143179
2. **Initialize Logger:** Set up the structured logger.
144-
3. **Iterate Sites:** For each site in the config:
180+
3. **Iterate Sources:** For each source in the config:
145181
1. **Initialize Database:** Connect to SQLite or Qdrant, create necessary tables/collections.
146-
2. **Crawl:**
147-
* Start at the base `url`.
148-
* Use Puppeteer (`processPage`) to fetch and render HTML.
149-
* Use Readability to extract main content.
150-
* Sanitize HTML.
151-
* Convert HTML to Markdown using Turndown.
152-
* Use `axios`/`cheerio` on the *original* fetched page (before Puppeteer) to find new links to add to the crawl queue.
153-
* Keep track of all visited URLs.
154-
3. **Process Content:** For each crawled page's Markdown:
182+
2. **Process by Source Type:**
183+
- **For Websites:**
184+
* Start at the base `url`.
185+
* Use Puppeteer (`processPage`) to fetch and render HTML.
186+
* Use Readability to extract main content.
187+
* Sanitize HTML.
188+
* Convert HTML to Markdown using Turndown.
189+
* Use `axios`/`cheerio` on the *original* fetched page (before Puppeteer) to find new links to add to the crawl queue.
190+
* Keep track of all visited URLs.
191+
- **For GitHub Repositories:**
192+
* Fetch issues and comments using the GitHub API.
193+
* Convert to formatted Markdown.
194+
* Track last run date to support incremental updates.
195+
3. **Process Content:** For each processed page or issue:
155196
* **Chunk:** Split Markdown into smaller `DocumentChunk` objects based on headings and size.
156197
* **Hash Check:** Generate a hash of the chunk content. Check if a chunk with the same ID exists in the DB and if its hash matches.
157198
* **Embed (if needed):** If the chunk is new or changed, call the OpenAI API (`createEmbeddings`) to get the vector embedding.
158199
* **Store:** Insert or update the chunk, metadata, hash, and embedding in the database (SQLite `vec_items` table or Qdrant collection).
159-
4. **Cleanup:** After crawling the site, query the database for all chunks associated with that site's URL prefix. Delete any chunks whose URLs were *not* in the set of visited URLs for the current run.
160-
4. **Complete:** Log completion status.
200+
4. **Cleanup:** After processing, remove any obsolete chunks from the database.
201+
4. **Complete:** Log completion status.

0 commit comments

Comments
 (0)