Building a Vectra based Document Index #8

Stevenic · 2023-06-22T08:17:01Z

Stevenic
Jun 22, 2023
Maintainer

Just sharing some ideas for how to use Vectra to build a full fledged document index.

Yes you could just break each document into chunks and then add all the individual chunks to a Vectra index that aggregates all the documents but my specific goal is to first be able to find the most relevant documents semantically and then find the relevant parts of each individual document to embed in a prompt.

You need multiple indexes for this but since they’re all local they’re fast and free :)

The core idea is that every document is first added to its own local Vectra index. Then the document gets added to a master index which aggregates all documents. The goal of the master index is to identify the documents that might best contain the answer. There are a couple of ways this could work.

You could just add all of the individual chunks to the master index but that’s going to take up a lot of space and I’m questioning whether it actually buys you anything…

The alternative is to take the first 5 chunks of every document and add that to the master index. That limits the number of chunks in memory for any document and I’d go a step further to not even save any metadata for this master index at all since it will never be used .

ReneReiterer · 2023-06-22T12:15:42Z

ReneReiterer
Jun 22, 2023

You could also write a function that compares all chunks of a document, and give you back 5 chunks that are the least similar. So that way, the master index for that document would contain the most amount of information about the document with the least amount of space used

You could also theoretically remove certain function words like "the, or, and, is, a ....", that way, the master index would be even smaller while still containing the important information

4 replies

Stevenic Jun 22, 2023
Maintainer Author

You could also write a function that compares all chunks of a document, and give you back 5 chunks that are the least similar. So that way, the master index for that document would contain the most amount of information about the document with the least amount of space used

That's a great idea...

You could also theoretically remove certain function words like "the, or, and, is, a ....", that way, the master index would be even smaller while still containing the important information

The master index should only ever contain embeddings so if each entry in the master index was comprised of 5 chunks that's 5 arrays of 1536 numbers each.

ReneReiterer Jun 23, 2023

The master index should only ever contain embeddings so if each entry in the master index was comprised of 5 chunks that's 5 arrays of 1536 numbers each.

Yeah i get that, i thought more of like, before you create the embeddings for the master index, you remove the function words from the chunks, so that the embeddings and therefore the master index will be a smaller size, but i dont know if thats even going to do anything to the performance

Stevenic Jun 23, 2023
Maintainer Author

No it will just reduce the memory footprint. There are some clustering strategies you could use to reduce the search space but definitely beyond the scope of Vectra

Stevenic Jun 23, 2023
Maintainer Author

I intended Vectra to just be a fast and inexpensive (free) vector db for small document corpuses. Using partitioning I think you could scale it to medium sized corpuses but I’m not trying to compete with hosted services like pinecone.

Stevenic · 2023-09-20T15:39:52Z

Stevenic
Sep 20, 2023
Maintainer Author

As an FYI... Vectra now has a full fledged document index and a CLI for ingesting documents. The new LocalDocumentIndexer class can be used to manage a catalog of documents. You can even query the documents using plain text queries and documents can have their own metadata and be filtered like chunks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building a Vectra based Document Index #8

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Building a Vectra based Document Index #8

Uh oh!

Stevenic Jun 22, 2023 Maintainer

Replies: 2 comments · 4 replies

Uh oh!

ReneReiterer Jun 22, 2023

Uh oh!

Stevenic Jun 22, 2023 Maintainer Author

Uh oh!

ReneReiterer Jun 23, 2023

Uh oh!

Stevenic Jun 23, 2023 Maintainer Author

Uh oh!

Stevenic Jun 23, 2023 Maintainer Author

Uh oh!

Stevenic Sep 20, 2023 Maintainer Author

Stevenic
Jun 22, 2023
Maintainer

Replies: 2 comments 4 replies

ReneReiterer
Jun 22, 2023

Stevenic Jun 22, 2023
Maintainer Author

Stevenic Jun 23, 2023
Maintainer Author

Stevenic Jun 23, 2023
Maintainer Author

Stevenic
Sep 20, 2023
Maintainer Author