Fast retreival of 8k vectors of dim 1024 #2005

ExtReMLapin · 2025-02-20T07:47:35Z

ExtReMLapin
Feb 20, 2025

Hello,
We recently moved away from a H5 file to arcadedb to leverage all the DB safety features (ACID etc).

We now stores our embeddings (linked to the node it does the embedding with an edge) in Arcade.

Issue is retreival at boot of out software is quite slow.

Cypher query

match (vector:EMBEDDING)-[:embb]->(targetNode)
        return ID(targetNode) as rid, vector as vector

Takes 21s for 7692 entries

SQL query

MATCH {type: EMBEDDING, as: embb}-->{ as: target}
RETURN embb.vector, target.asRID()

Takes 21s for 7692 entries

profiling of the cypher query returns that :

Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
ArcadeFilterByTypeStep(vertex,EMBEDDING)                            7692        7692           2,799     1,00
HasStep@[vector]                                                    7692        7692           0,994     0,35
VertexStep(OUT,[embb],vertex)@[targetNode]                          7692        7692          10,008     3,57
SelectStep(last,[targetNode, vector])                               7692        7692           2,538     0,90
NoOpBarrierStep(2500)                                               7692        7692           4,808     1,71
ProjectStep([rid, vector],[[SelectOneStep(last,...                  7692        7692         259,488    92,46
  SelectOneStep(last,targetNode,null)                               7692        7692           1,589
  IdStep                                                            7692        7692           1,269
  SelectOneStep(last,vector,null)                                   7692        7692           1,380
  ChooseStep([PropertiesStep([vector],value), H...                  7692        7692         234,318
    PropertiesStep([vector],value)                                  7692        7692         116,185
    HasNextStep                                                     7692        7692           1,100
    PropertiesStep([vector],value)                                  7692        7692         106,749
    EndStep                                                         7692        7692           1,273
    ConstantStep(null)                                                                         0,000
    EndStep                                                                                    0,000
                                            >TOTAL                     -           -         280,637        -

Returning the vector RID instead of the vector itself obviously boosts the query a lot, (takes less than 1sec) which leads me to believe it's returning the 8k*1024 dim that slows the whole query down.

vector Property was created as an ARRAY_OF_FLOATS

Answered by lvca

Feb 22, 2025

Are these times using ArcadeDB from Python? Or is it just a python code reading writing arrays <-> json?

JSON is not the most optimized format for transferring arrays of numbers. They have to be converted into a string back and forth. Have you tried using Postgres driver instead?

View full answer

lvca · 2025-02-20T14:48:34Z

lvca
Feb 20, 2025
Maintainer

Maybe the serializer of arrays is not efficient and the cost is just in serializing the result back.

Alsomif.you store large records, it could be helpful changing the page size from 65k to 2x or 4x.

Could you please provide a test case or even a database with similar data to spin some tests locally?

21 replies

lvca Feb 20, 2025
Maintainer

Ok, I run MATCH {type: EMBEDDING, as: embb}-->{ as: target} RETURN embb.vector, target.asRID() and I see it takes a lot of time on serializing large arrays into JSON. I don't think it's a performance problem of the engine, it's just JSON/HTTP stuff.

In your use case, are you using a driver or a Java app?

ExtReMLapin Feb 21, 2025
Author

We are only using the rest API.

ExtReMLapin Feb 21, 2025
Author

Also, we're storing in ArcadeDB and fetching only to use the ACID features, we retreive all of it to do GPU cosine sim, clustering etc.
From my understanding using vector indexes right now from rest is a little complicated.

Also, unless I misunderstood your message, json serialization should not be that slow.
In python (no jit compiler like java), getting the data, converting to python and back to json string takes less than two seconds

Getting vectors for DB ORANO_DOC took 15.355653762817383s
Converting to numpy took 0.11538863182067871s
Converting back to json took 1.7384111881256104s

lvca Feb 22, 2025
Maintainer

Are these times using ArcadeDB from Python? Or is it just a python code reading writing arrays <-> json?

JSON is not the most optimized format for transferring arrays of numbers. They have to be converted into a string back and forth. Have you tried using Postgres driver instead?

Answer selected by ExtReMLapin

ExtReMLapin Feb 22, 2025
Author

"Getting vectors" (first line) was using the rest API in python, and included converting the json to python structures.

Line 3 was converting python -> json, it takes some time but we're not waiting for 10 seconds. This is why I find it surprising when you say it's mostly related to the serialization.

At the office, we can give a try with postgres on monday.

lvca Feb 22, 2025
Maintainer

I could profile the query next week to see any bottlenecks.

In your case wouldn't be faster to process the vectors in ArcadeDB Server? What about writing a simple plugin (in Java).or a Javascript function that query and process them locally?

Not sure if makes sensè.

ExtReMLapin Feb 22, 2025
Author

I totally agree the best would be to use the vectors directly inside arcadedb and return only the results (RIDs) we need on time.

To stick with our "clustering" metodology we could still pick the top 100 closest values and later on perform clustering on it, it's true.

We never really managed to get vectors to work in ArcadeDB (we didn't spend much time trying to be honest), it's only recently I gave a try again, but I met multiple issues (Including #1999 )

One year ago when we failed to get vectors to work, we just stored our embeddings vectors in a raw file (PKL) and few months ago moved to a H5 file format to be able to save it on the fly, but we met corruptions issues recently, this is what made us move to Arcade saving our vectors.

ExtReMLapin Feb 27, 2025
Author

Sorry for the delayed answer, this week has been like a trainwreck at the office.

I could not get the postgres driver to work , same error as you mentioned here #399 (comment)

Edit : never mind, using the other package pip install psycopg fixes the issue

Uh oh!

Fast retreival of 8k vectors of dim 1024 #2005

Uh oh!

Uh oh!

ExtReMLapin Feb 20, 2025

Replies: 1 comment · 21 replies

Uh oh!

lvca Feb 20, 2025 Maintainer

Uh oh!

lvca Feb 20, 2025 Maintainer

Uh oh!

ExtReMLapin Feb 21, 2025 Author

Uh oh!

ExtReMLapin Feb 21, 2025 Author

Uh oh!

lvca Feb 22, 2025 Maintainer

Uh oh!

ExtReMLapin Feb 22, 2025 Author

Uh oh!

lvca Feb 22, 2025 Maintainer

Uh oh!

ExtReMLapin Feb 22, 2025 Author

Uh oh!

Uh oh!

ExtReMLapin Feb 27, 2025 Author

ExtReMLapin
Feb 20, 2025

Replies: 1 comment 21 replies

lvca
Feb 20, 2025
Maintainer

lvca Feb 20, 2025
Maintainer

ExtReMLapin Feb 21, 2025
Author

ExtReMLapin Feb 21, 2025
Author

lvca Feb 22, 2025
Maintainer

ExtReMLapin Feb 22, 2025
Author

lvca Feb 22, 2025
Maintainer

ExtReMLapin Feb 22, 2025
Author

ExtReMLapin Feb 27, 2025
Author