-
Notifications
You must be signed in to change notification settings - Fork 260
docs: comprehensive explanation of document vectorization #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did half review, but posting my comments here so you can see them in advance.
- Maintains synchronization between documents and their embeddings | ||
- Provides resilient error handling and monitoring | ||
|
||
## Setting Up Document Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe it is a personal preference, but I really hate "title case". It feels so unnatural to me 🤷
|
||
PGAI's document vectorization system addresses these challenges through a declarative approach that handles loading, parsing, chunking, and embedding documents with minimal configuration. This architecture: | ||
|
||
- Keeps document metadata in PostgreSQL while document content lives in optimized storage systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean "document metadata" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's explained below quite well, I mean things like created_at, updated_at, owner of docs, or anything else that you want to store about the document that isn't directly in the file itself. We can also call it document application data or something like that?
07462ef
to
7f4bbb2
Compare
|
||
For more embedding providers, see the [API Reference documentation](./api-reference.md#embedding-configuration). | ||
|
||
### More Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I link it at the very top of the document do you think I should link it here again?
}' | ||
``` | ||
|
||
Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. | |
Note that the role trust policy needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. |
|
||
Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. | ||
|
||
#### Grant Permissions to your bucket to the role |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe is not a nit anymore 😅 https://github.com/timescale/pgai/pull/644/files#r2057987453
``` | ||
|
||
|
||
### Syncing S3 to a Documents Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling this belongs more to like an example or tutorial rather than to this guide.
**2. Embedding API Rate Limits** | ||
|
||
If you encounter rate limits with embedding providers: | ||
- Adjust the processing batch size and concurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to how to do that
- Ensure S3 bucket names and object keys are correct | ||
|
||
|
||
**2. Embedding API Rate Limits** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**2. Embedding API Rate Limits** | |
**2. Embedding services API Rate Limits** |
Or similar. Whatever it clarifies this is out of our responsibility, not a pgai thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the structure here need a bit of thinking. For example I think maybe the s3 stuff should be a separate page. Also the vectorizer component piece reads too much like a reference.
One other comment. I think this needs an intro that discusses all the issues this doc addresses at the top.
|
||
```sql | ||
SELECT ai.create_vectorizer( | ||
'document'::regclass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure but including an explicit destination may be good here
|
||
A vectorizer is a declarative configuration that defines how documents are processed, chunked, and embedded. pgai's vectorizer system automatically keeps document embeddings in sync with source documents. You can find the reference for vectorizers in the [API Reference documentation](./api-reference.md). | ||
|
||
### Vectorizer Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Vectorizer Configuration | |
### Example Vectorizer Configuration |
```sql | ||
-- Basic similarity search | ||
SELECT d.title, e.chunk, e.embedding <=> <search_embedding> AS distance | ||
FROM documentation_embedding_store e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the view instead? (in all of these examples)
4. Splits text into chunks at common markdown breaking points (headers, paragraphs, etc.) | ||
5. Generates embeddings using OpenAI's `text-embedding-3-small` model | ||
|
||
### Vectorizer Components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section reads too much like a reference. Instead it should be an opinionated guide with links to the reference sections in the API
|
||
The error table includes detailed information about what went wrong. | ||
|
||
## S3 Integration Guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too far down I think belongs much higher and needs reference from other places.
``` | ||
|
||
|
||
### Syncing S3 to a Documents Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if all the s3 stuff belongs in a separate doc?
6029328
to
a8afd5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good enough for now. We'll need to iterate later
Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Matvey Arye <[email protected]> Signed-off-by: Jascha Beste <[email protected]>
Update docs/vectorizer/s3-documents.md Co-authored-by: Matvey Arye <[email protected]> Signed-off-by: Jascha Beste <[email protected]>
03fda1c
to
b19b107
Compare
No description provided.