-
Notifications
You must be signed in to change notification settings - Fork 248
docs: comprehensive explanation of document vectorization #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did half review, but posting my comments here so you can see them in advance.
- Maintains synchronization between documents and their embeddings | ||
- Provides resilient error handling and monitoring | ||
|
||
## Setting Up Document Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe it is a personal preference, but I really hate "title case". It feels so unnatural to me 🤷
|
||
PGAI's document vectorization system addresses these challenges through a declarative approach that handles loading, parsing, chunking, and embedding documents with minimal configuration. This architecture: | ||
|
||
- Keeps document metadata in PostgreSQL while document content lives in optimized storage systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean "document metadata" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's explained below quite well, I mean things like created_at, updated_at, owner of docs, or anything else that you want to store about the document that isn't directly in the file itself. We can also call it document application data or something like that?
- Keeps document metadata in PostgreSQL while document content lives in optimized storage systems | ||
- Automatically processes documents into LLM-friendly formats | ||
- Maintains synchronization between documents and their embeddings | ||
- Provides resilient error handling and monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the "monitoring" part tbh 🤔 What are the solutions we offer in terms of monitoring here besides logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we do have the vectorizer errors table that gives a good overview I think. But I also agree, this should/could be shorter.
4. Splits text into chunks at common markdown breaking points (headers, paragraphs, etc.) | ||
5. Generates embeddings using OpenAI's `text-embedding-3-small` model | ||
|
||
### Vectorizer Components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense this be located on top of the "basic vectorizer configuration" so the last one becomes more like the "config reference"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, the idea is that it shows a full config first and then explains the parts of it rather than the other way around. I think this is a good pattern since ideally most of the API is self-explanatory. The user only has to read on further if they want to understand details or different options then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Askir
3e30a66
to
07462ef
Compare
Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]> Update docs/vectorizer/document-vectorization.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]>
07462ef
to
7f4bbb2
Compare
- HTTP/HTTPS URLs (e.g. `https://example.com/file.pdf`) | ||
- Local files on the worker machine (e.g. `/path/to/file.pdf`) | ||
|
||
In theory this also supports other source like GCS and other blob storage that's supported by the `smart_open` library. However, this is not supported on Timescale Cloud and if you want to use it yourself in a self-hosted installation, you need to make sure that necessary dependencies are installed. Check the [smart open documentation](https://pypi.org/project/smart-open/) for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rewording could help making this sentence sound more clear (at least to me, very subjective).
In theory this also supports other source like GCS and other blob storage that's supported by the `smart_open` library. However, this is not supported on Timescale Cloud and if you want to use it yourself in a self-hosted installation, you need to make sure that necessary dependencies are installed. Check the [smart open documentation](https://pypi.org/project/smart-open/) for details. | |
Internally, we use the [smart_open](https://pypi.org/project/smart-open/) library to connect to your configured buckets. That means this feature can connect to Google Cloud Storage and any other blob store supported by `smart_open`; however, this capability isn’t available on Timescale Cloud, and we do not officially support it. If you want to enable it in a self-hosted installation, you need to install the appropriate `smart_open` dependencies. See the [smart-open documentation](https://pypi.org/project/smart-open/) for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested text:
We use the smart_open library to connect to the URI. That means any URI that can work with smart_open should work (including Google Cloud, Azure, etc.); however, only AWS S3 is supported on Timescale Cloud. In a self-hosted installation, other provider should work but you need to install the appropriate smart_open
dependencies, and test it yourself. See the smart-open documentation for details.
|
||
This loads documents directly from binary data stored in a PostgreSQL BYTEA column. | ||
|
||
#### Parsing Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these parsing options are already documented in the api-reference
I would just link to them instead of rewriting the docs.
|
||
#### Chunking Strategies | ||
|
||
Chunking divides documents into smaller pieces for embedding. The recommended approach for documents is recursive character splitting: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunking divides documents into smaller pieces for embedding. The recommended approach for documents is recursive character splitting: | |
Chunking divides documents into smaller pieces for embedding. As per today, the recommended chunking approach for documents is recursive character splitting: |
] | ||
) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation availabl in api-reference. It's great to have the example recommendation, but I would at least include the link later to the reference.
|
||
For more embedding providers, see the [API Reference documentation](./api-reference.md#embedding-configuration). | ||
|
||
### More Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}' | ||
``` | ||
|
||
Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. | |
Note that the role trust policy needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. |
|
||
Note that the assumeRole permission needs you to replace the `projectId/serviceId` with the actual project and service id of your Timescale Cloud installation. You can find this in the Timescale Cloud console. This is a security measure that prevents the [confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), which would otherwise allow other Timescale Cloud users to access your buckets if they guessed your role name and accountId. | ||
|
||
#### Grant Permissions to your bucket to the role |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe is not a nit anymore 😅 https://github.com/timescale/pgai/pull/644/files#r2057987453
``` | ||
|
||
|
||
### Syncing S3 to a Documents Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling this belongs more to like an example or tutorial rather than to this guide.
**2. Embedding API Rate Limits** | ||
|
||
If you encounter rate limits with embedding providers: | ||
- Adjust the processing batch size and concurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to how to do that
- Ensure S3 bucket names and object keys are correct | ||
|
||
|
||
**2. Embedding API Rate Limits** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**2. Embedding API Rate Limits** | |
**2. Embedding services API Rate Limits** |
Or similar. Whatever it clarifies this is out of our responsibility, not a pgai thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the structure here need a bit of thinking. For example I think maybe the s3 stuff should be a separate page. Also the vectorizer component piece reads too much like a reference.
One other comment. I think this needs an intro that discusses all the issues this doc addresses at the top.
|
||
#### Extended Document Table | ||
|
||
For your system, you can include any additional metadata that you might need to filter or classify documents. To facilitate synchronization, consider including `created_at` and `updated_at` updates to these fields will then trigger the re-embedding process: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your system, you can include any additional metadata that you might need to filter or classify documents. To facilitate synchronization, consider including `created_at` and `updated_at` updates to these fields will then trigger the re-embedding process: | |
For real applications, you will often want to include additional metadata that you might need to filter or classify documents. To facilitate synchronization, consider including `created_at` and `updated_at` updates to these fields will then trigger the re-embedding process: |
|
||
```sql | ||
SELECT ai.create_vectorizer( | ||
'document'::regclass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure but including an explicit destination may be good here
|
||
A vectorizer is a declarative configuration that defines how documents are processed, chunked, and embedded. pgai's vectorizer system automatically keeps document embeddings in sync with source documents. You can find the reference for vectorizers in the [API Reference documentation](./api-reference.md). | ||
|
||
### Vectorizer Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Vectorizer Configuration | |
### Example Vectorizer Configuration |
4. Splits text into chunks at common markdown breaking points (headers, paragraphs, etc.) | ||
5. Generates embeddings using OpenAI's `text-embedding-3-small` model | ||
|
||
### Vectorizer Components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Askir
- HTTP/HTTPS URLs (e.g. `https://example.com/file.pdf`) | ||
- Local files on the worker machine (e.g. `/path/to/file.pdf`) | ||
|
||
In theory this also supports other source like GCS and other blob storage that's supported by the `smart_open` library. However, this is not supported on Timescale Cloud and if you want to use it yourself in a self-hosted installation, you need to make sure that necessary dependencies are installed. Check the [smart open documentation](https://pypi.org/project/smart-open/) for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested text:
We use the smart_open library to connect to the URI. That means any URI that can work with smart_open should work (including Google Cloud, Azure, etc.); however, only AWS S3 is supported on Timescale Cloud. In a self-hosted installation, other provider should work but you need to install the appropriate smart_open
dependencies, and test it yourself. See the smart-open documentation for details.
```sql | ||
-- Basic similarity search | ||
SELECT d.title, e.chunk, e.embedding <=> <search_embedding> AS distance | ||
FROM documentation_embedding_store e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the view instead? (in all of these examples)
4. Splits text into chunks at common markdown breaking points (headers, paragraphs, etc.) | ||
5. Generates embeddings using OpenAI's `text-embedding-3-small` model | ||
|
||
### Vectorizer Components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section reads too much like a reference. Instead it should be an opinionated guide with links to the reference sections in the API
|
||
The error table includes detailed information about what went wrong. | ||
|
||
## S3 Integration Guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too far down I think belongs much higher and needs reference from other places.
``` | ||
|
||
|
||
### Syncing S3 to a Documents Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if all the s3 stuff belongs in a separate doc?
No description provided.