Skip to content

arroy for vectors #2074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: master
Choose a base branch
from
Open

arroy for vectors #2074

wants to merge 32 commits into from

Conversation

ricopinazo
Copy link
Collaborator

@ricopinazo ricopinazo commented May 2, 2025

What changes were proposed in this pull request?

Why are the changes needed?

The old approach of defining custom dcument search queries in python had two problems:

  1. It used the python host as the runtime, leading to potential deadlocks etc
  2. It used to load all the vectorised graphs in the server, which didn't scale

Does this PR introduce any user-facing change? If yes is this documented?

How was this patch tested?

Are there any further changes required?

  • remove heed from deps
  • try to remove all the duplication for tanslating between u32 and entity id
  • double-check having max lmdb size set to 1TB doesnt cause the disk usage to grow for small graphs
  • have a separate module for the db abstraction
  • keep adding benchmarks, use different query for each call in the current one
  • stop loading all graphs for the global plugins
  • Reuse EntityRef for DocumentEntity
  • re-enable save_embeddings function
  • move GqlDocument under graph folder
  • add window parameters to vertor selection and vectorised graph functions in GraphQL
  • reiview if Im actually using async_stream crate and the others
  • double-check Ctrl+C still works
  • decide proper size for the heed env
  • test the miss rate in a real-world scenario: 0.02% misses in a 100k dataset
  • remove all unwraps from cache.rs

"embedding": [1.0, 2.0],
},
],
},
"vectorisedGraph": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to move this api inside of graph -- which would fail if the graphs aren't vectorised

selection.expand_entities_by_similarity("node1", 100, (20, 100))
contents = [doc.content for doc in selection.get_documents()]
assert contents == ["node1", "edge1", "node2", "edge2", "node3"]
assert contents == ["node1", "edge1", "node2"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did these have to be deleted?

@@ -229,6 +206,7 @@ def test_default_template():

vg = g.vectorise(constant_embedding)

node_docs = vg.entities_by_similarity(query="whatever", limit=10).get_documents()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed?

@@ -27,6 +27,8 @@ use pyo3::{
types::{PyFunction, PyList},
};

type DynamicVectorisedGraph = VectorisedGraph<DynamicGraph>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the Graph storage to have a vectorstore inside of it in the same manner that it has an index


use super::embeddings::EmbeddingFunction;

const MAX_DISK_ITEMS: usize = 1_000_000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For next PR - can we make these confirgurable?

&self,
ctx: &Context<'_>,
query: String,
limit: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to revamp how algorithms work in graphql

.unwrap_or(Arc::new(None));

GraphWithVectors::read_from_folder(folder, embedding, cache, self.create_index)
.unwrap_or_else(|| VectorCache::in_memory(openai_embedding)); // TODO: review, this is weird...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanna have another look at this?

window: Option<Window>,
) -> GraphResult<GqlVectorSelection> {
let vector = ctx.embed_query(query).await?;
let w = window.into_window_tuple();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stick some spawn blockings around these


#[ResolvedObjectFields]
impl GqlVectorSelection {
async fn nodes(&self) -> Vec<Node> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of these will need a spawn blocking

window: Option<Window>,
) -> GraphResult<GqlVectorSelection> {
let vector = ctx.embed_query(query).await?;
let w = window.into_window_tuple();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note here for us to add spawn blocking when we redo algorithms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants