docs: comprehensive explanation of document vectorization #644

Askir · 2025-04-24T09:13:45Z

No description provided.

smoya

I did half review, but posting my comments here so you can see them in advance.

smoya · 2025-04-24T09:42:56Z

docs/vectorizer/document-vectorization.md

+- Maintains synchronization between documents and their embeddings
+- Provides resilient error handling and monitoring
+
+## Setting Up Document Storage


Nit: Maybe it is a personal preference, but I really hate "title case". It feels so unnatural to me 🤷

docs/vectorizer/document-vectorization.md

smoya · 2025-04-24T09:48:59Z

docs/vectorizer/document-vectorization.md

+
+PGAI's document vectorization system addresses these challenges through a declarative approach that handles loading, parsing, chunking, and embedding documents with minimal configuration. This architecture:
+
+- Keeps document metadata in PostgreSQL while document content lives in optimized storage systems


What does it mean "document metadata" here?

I think it's explained below quite well, I mean things like created_at, updated_at, owner of docs, or anything else that you want to store about the document that isn't directly in the file itself. We can also call it document application data or something like that?

docs/vectorizer/document-vectorization.md

smoya · 2025-04-25T12:47:05Z

docs/vectorizer/document-vectorization.md

+
+For more embedding providers, see the [API Reference documentation](./api-reference.md#embedding-configuration).
+
+### More Examples


WDYT about adding a link to https://github.com/timescale/pgai/tree/main/examples/embeddings_from_documents#step-3-configure-and-create-the-vectorizer?

I link it at the very top of the document do you think I should link it here again?

smoya · 2025-04-25T12:58:54Z

docs/vectorizer/document-vectorization.md

+  }'
+```
+
+Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.


Suggested change

Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.

Note that the role trust policy needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.

smoya · 2025-04-25T12:59:30Z

docs/vectorizer/document-vectorization.md

+
+Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.
+
+#### Grant Permissions to your bucket to the role


Maybe is not a nit anymore 😅 https://github.com/timescale/pgai/pull/644/files#r2057987453

smoya · 2025-04-25T13:01:00Z

docs/vectorizer/document-vectorization.md

+```
+
+
+### Syncing S3 to a Documents Table


I have the feeling this belongs more to like an example or tutorial rather than to this guide.

smoya · 2025-04-25T13:01:28Z

docs/vectorizer/document-vectorization.md

+**2. Embedding API Rate Limits**
+
+If you encounter rate limits with embedding providers:
+- Adjust the processing batch size and concurrency


Add a link to how to do that

smoya · 2025-04-25T13:02:20Z

docs/vectorizer/document-vectorization.md

+- Ensure S3 bucket names and object keys are correct
+
+
+**2. Embedding API Rate Limits**


Suggested change

**2. Embedding API Rate Limits**

**2. Embedding services API Rate Limits**

Or similar. Whatever it clarifies this is out of our responsibility, not a pgai thing.

cevian

I think the structure here need a bit of thinking. For example I think maybe the s3 stuff should be a separate page. Also the vectorizer component piece reads too much like a reference.

One other comment. I think this needs an intro that discusses all the issues this doc addresses at the top.

docs/vectorizer/document-vectorization.md

cevian · 2025-04-25T14:33:45Z

docs/vectorizer/document-vectorization.md

+
+```sql
+SELECT ai.create_vectorizer(
+    'document'::regclass,


not sure but including an explicit destination may be good here

cevian · 2025-04-25T14:34:33Z

docs/vectorizer/document-vectorization.md

+
+A vectorizer is a declarative configuration that defines how documents are processed, chunked, and embedded. pgai's vectorizer system automatically keeps document embeddings in sync with source documents. You can find the reference for vectorizers in the [API Reference documentation](./api-reference.md).
+
+### Vectorizer Configuration


Suggested change

### Vectorizer Configuration

### Example Vectorizer Configuration

docs/vectorizer/document-vectorization.md

cevian · 2025-04-25T14:55:18Z

docs/vectorizer/document-vectorization.md

+```sql
+-- Basic similarity search
+SELECT d.title, e.chunk, e.embedding <=> <search_embedding> AS distance
+FROM documentation_embedding_store e


why not use the view instead? (in all of these examples)

cevian · 2025-04-25T15:01:38Z

docs/vectorizer/document-vectorization.md

+4. Splits text into chunks at common markdown breaking points (headers, paragraphs, etc.)
+5. Generates embeddings using OpenAI's `text-embedding-3-small` model
+
+### Vectorizer Components


This section reads too much like a reference. Instead it should be an opinionated guide with links to the reference sections in the API

cevian · 2025-04-25T15:08:22Z

docs/vectorizer/document-vectorization.md

+
+The error table includes detailed information about what went wrong.
+
+## S3 Integration Guide


This is too far down I think belongs much higher and needs reference from other places.

cevian · 2025-04-25T15:09:26Z

docs/vectorizer/document-vectorization.md

+```
+
+
+### Syncing S3 to a Documents Table


I wonder if all the s3 stuff belongs in a separate doc?

cevian

Good enough for now. We'll need to iterate later

docs/vectorizer/s3-documents.md

Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Matvey Arye <[email protected]> Signed-off-by: Jascha Beste <[email protected]>

Update docs/vectorizer/s3-documents.md Co-authored-by: Matvey Arye <[email protected]> Signed-off-by: Jascha Beste <[email protected]>

Askir marked this pull request as ready for review April 24, 2025 09:35

Askir requested a review from a team as a code owner April 24, 2025 09:35

smoya reviewed Apr 24, 2025

View reviewed changes

Askir force-pushed the jascha/document-vectorizer-docs branch 6 times, most recently from 07462ef to 7f4bbb2 Compare April 24, 2025 12:37

smoya reviewed Apr 25, 2025

View reviewed changes

cevian requested changes Apr 25, 2025

View reviewed changes

Askir force-pushed the jascha/document-vectorizer-docs branch 4 times, most recently from 6029328 to a8afd5d Compare April 28, 2025 17:23

Askir temporarily deployed to internal-contributors April 28, 2025 17:26 — with GitHub Actions Inactive

cevian approved these changes Apr 28, 2025

View reviewed changes

docs/vectorizer/s3-documents.md Outdated Show resolved Hide resolved

Askir temporarily deployed to internal-contributors April 28, 2025 17:33 — with GitHub Actions Inactive

Askir and others added 4 commits April 28, 2025 19:34

chore: restructure document embedding docs, seprate s3 file

91b781f

chore: remove sync section for now

5990b5d

chore: regenerate llms.txt

b19b107

Update docs/vectorizer/s3-documents.md Co-authored-by: Matvey Arye <[email protected]> Signed-off-by: Jascha Beste <[email protected]>

Askir force-pushed the jascha/document-vectorizer-docs branch from 03fda1c to b19b107 Compare April 28, 2025 17:34

Askir temporarily deployed to internal-contributors April 28, 2025 17:34 — with GitHub Actions Inactive

chore: fix comment in sql in lib

0672e7e

Askir temporarily deployed to internal-contributors April 28, 2025 17:44 — with GitHub Actions Inactive

Askir merged commit 930a6f0 into main Apr 28, 2025
10 checks passed

Askir deleted the jascha/document-vectorizer-docs branch April 28, 2025 18:20


		PGAI's document vectorization system addresses these challenges through a declarative approach that handles loading, parsing, chunking, and embedding documents with minimal configuration. This architecture:

		- Keeps document metadata in PostgreSQL while document content lives in optimized storage systems


		For more embedding providers, see the [API Reference documentation](./api-reference.md#embedding-configuration).

		### More Examples


		Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.

		#### Grant Permissions to your bucket to the role

		- Ensure S3 bucket names and object keys are correct


		2. Embedding API Rate Limits

	2. Embedding API Rate Limits
	2. Embedding services API Rate Limits


		A vectorizer is a declarative configuration that defines how documents are processed, chunked, and embedded. pgai's vectorizer system automatically keeps document embeddings in sync with source documents. You can find the reference for vectorizers in the [API Reference documentation](./api-reference.md).

		### Vectorizer Configuration

	### Vectorizer Configuration
	### Example Vectorizer Configuration


		The error table includes detailed information about what went wrong.

		## S3 Integration Guide

docs: comprehensive explanation of document vectorization #644

docs: comprehensive explanation of document vectorization #644

Uh oh!

Conversation

Askir commented Apr 24, 2025

Uh oh!

smoya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cevian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cevian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!