Skip to content

docs: comprehensive explanation of document vectorization #644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 28, 2025

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Apr 24, 2025

No description provided.

@Askir Askir marked this pull request as ready for review April 24, 2025 09:35
@Askir Askir requested a review from a team as a code owner April 24, 2025 09:35
Copy link
Contributor

@smoya smoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did half review, but posting my comments here so you can see them in advance.

- Maintains synchronization between documents and their embeddings
- Provides resilient error handling and monitoring

## Setting Up Document Storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe it is a personal preference, but I really hate "title case". It feels so unnatural to me 🤷


PGAI's document vectorization system addresses these challenges through a declarative approach that handles loading, parsing, chunking, and embedding documents with minimal configuration. This architecture:

- Keeps document metadata in PostgreSQL while document content lives in optimized storage systems
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean "document metadata" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's explained below quite well, I mean things like created_at, updated_at, owner of docs, or anything else that you want to store about the document that isn't directly in the file itself. We can also call it document application data or something like that?

@Askir Askir force-pushed the jascha/document-vectorizer-docs branch 6 times, most recently from 07462ef to 7f4bbb2 Compare April 24, 2025 12:37

For more embedding providers, see the [API Reference documentation](./api-reference.md#embedding-configuration).

### More Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I link it at the very top of the document do you think I should link it here again?

}'
```

Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.
Note that the role trust policy needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.


Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId.

#### Grant Permissions to your bucket to the role
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```


### Syncing S3 to a Documents Table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling this belongs more to like an example or tutorial rather than to this guide.

**2. Embedding API Rate Limits**

If you encounter rate limits with embedding providers:
- Adjust the processing batch size and concurrency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to how to do that

- Ensure S3 bucket names and object keys are correct


**2. Embedding API Rate Limits**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**2. Embedding API Rate Limits**
**2. Embedding services API Rate Limits**

Or similar. Whatever it clarifies this is out of our responsibility, not a pgai thing.

Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the structure here need a bit of thinking. For example I think maybe the s3 stuff should be a separate page. Also the vectorizer component piece reads too much like a reference.

One other comment. I think this needs an intro that discusses all the issues this doc addresses at the top.


```sql
SELECT ai.create_vectorizer(
'document'::regclass,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure but including an explicit destination may be good here


A vectorizer is a declarative configuration that defines how documents are processed, chunked, and embedded. pgai's vectorizer system automatically keeps document embeddings in sync with source documents. You can find the reference for vectorizers in the [API Reference documentation](./api-reference.md).

### Vectorizer Configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Vectorizer Configuration
### Example Vectorizer Configuration

```sql
-- Basic similarity search
SELECT d.title, e.chunk, e.embedding <=> <search_embedding> AS distance
FROM documentation_embedding_store e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the view instead? (in all of these examples)

4. Splits text into chunks at common markdown breaking points (headers, paragraphs, etc.)
5. Generates embeddings using OpenAI's `text-embedding-3-small` model

### Vectorizer Components
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section reads too much like a reference. Instead it should be an opinionated guide with links to the reference sections in the API


The error table includes detailed information about what went wrong.

## S3 Integration Guide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too far down I think belongs much higher and needs reference from other places.

```


### Syncing S3 to a Documents Table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all the s3 stuff belongs in a separate doc?

@Askir Askir force-pushed the jascha/document-vectorizer-docs branch 4 times, most recently from 6029328 to a8afd5d Compare April 28, 2025 17:23
@Askir Askir temporarily deployed to internal-contributors April 28, 2025 17:26 — with GitHub Actions Inactive
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for now. We'll need to iterate later

@Askir Askir temporarily deployed to internal-contributors April 28, 2025 17:33 — with GitHub Actions Inactive
Askir and others added 4 commits April 28, 2025 19:34
Update docs/vectorizer/document-vectorization.md

Co-authored-by: Sergio Moya <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>

Update docs/vectorizer/document-vectorization.md

Co-authored-by: Sergio Moya <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>

Update docs/vectorizer/document-vectorization.md

Co-authored-by: Sergio Moya <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>

Update docs/vectorizer/document-vectorization.md

Co-authored-by: Sergio Moya <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>

Update docs/vectorizer/document-vectorization.md

Co-authored-by: Sergio Moya <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>

Update docs/vectorizer/document-vectorization.md

Co-authored-by: Matvey Arye <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>
Update docs/vectorizer/s3-documents.md

Co-authored-by: Matvey Arye <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>
@Askir Askir force-pushed the jascha/document-vectorizer-docs branch from 03fda1c to b19b107 Compare April 28, 2025 17:34
@Askir Askir temporarily deployed to internal-contributors April 28, 2025 17:34 — with GitHub Actions Inactive
@Askir Askir temporarily deployed to internal-contributors April 28, 2025 17:44 — with GitHub Actions Inactive
@Askir Askir merged commit 930a6f0 into main Apr 28, 2025
10 checks passed
@Askir Askir deleted the jascha/document-vectorizer-docs branch April 28, 2025 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants