Skip to content

Add python binding for vectorstore #573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions gel/_testbase.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,12 +178,12 @@ def _start_cluster(*, cleanup_atexit=True):
if line.startswith(b'READY='):
break
else:
raise RuntimeError('not ready')
raise RuntimeError("not ready")
break
except Exception:
time.sleep(1)
else:
raise RuntimeError('server status file not found')
raise RuntimeError("server status file not found")

data = json.loads(line.split(b'READY=')[1])
con_args = dict(host='localhost', port=data['port'])
Expand All @@ -203,8 +203,7 @@ def _start_cluster(*, cleanup_atexit=True):
]
)
else:
con_args['tls_ca_file'] = data['tls_cert_file']

con_args["tls_ca_file"] = data["tls_cert_file"]
client = gel.create_client(password='test', **con_args)
client.ensure_connected()
client.execute("""
Expand Down
30 changes: 28 additions & 2 deletions gel/ai/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#
# This source file is part of the EdgeDB open source project.
# This source file is part of the Gel open source project.
#
# Copyright 2024-present MagicStack Inc. and the EdgeDB authors.
# Copyright 2024-present MagicStack Inc. and the Gel authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -19,6 +19,21 @@
from .types import RAGOptions, ChatParticipantRole, Prompt, QueryContext
from .core import create_rag_client, RAGClient
from .core import create_async_rag_client, AsyncRAGClient
from .vectorstore import (
create_vstore,
create_async_vstore,
Record,
Vector,
SearchResult,
EmbeddingModel,
AsyncEmbeddingModel,
)
from .metadata_filter import (
MetadataFilter,
CompositeFilter,
FilterOperator,
FilterCondition,
)

__all__ = [
"RAGOptions",
Expand All @@ -29,4 +44,15 @@
"RAGClient",
"create_async_rag_client",
"AsyncRAGClient",
"create_vstore",
"create_async_vstore",
"Record",
"Vector",
"SearchResult",
"EmbeddingModel",
"AsyncEmbeddingModel",
"MetadataFilter",
"CompositeFilter",
"FilterOperator",
"FilterCondition",
]
134 changes: 134 additions & 0 deletions gel/ai/metadata_filter.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I don't like this query filter builder API - it's neither easier nor clearer than writing raw EdgeQL segment in a string. And this deviates from the JS-style query builder, at the same time, it's not Pythonic. Why did we want to add a filter API like this in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it looks a bit complex, this is how LamaIndex did it, I think. @deepbuzin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elprans can you pls take a look at this too

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we want to add a filter API like this in the first place?

Same global reason as for building the entire vectorstore - to enable users to use Gel in LLM-related applications and not have to learn EdgeQL (or anything really). It's meant to be dumb and familiar enough to justify using it in place of other vectorstores.

We spent some time discussing this way of implementing filters on a call about a month ago (FWIW I'm not thrilled with it either), and couldn't think of a better way. That's not to say there isn't one, but we had to pick something and move on.

Copy link

@anbuzin anbuzin Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although now that I think about it, maybe raw EdgeQL in our own binding isn’t the worst idea… If someone wants a LlamaIndex-style filter constructor they might as well use the LlamaIndex binding.

But downsides of the raw string are 1. no help from LSP and 2. EdgeQL for filters is not very pretty or concise with all the json fiddling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see. I've also left some comments in my proposal, but feel free to continue with what the team had agreed on.

Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import Union, List
from enum import Enum


class FilterOperator(str, Enum):
EQ = "="
NE = "!="
GT = ">"
LT = "<"
GTE = ">="
LTE = "<="
IN = "in"
NOT_IN = "not in"
LIKE = "like"
ILIKE = "ilike"
ANY = "any"
ALL = "all"
CONTAINS = "contains"
EXISTS = "exists"


class FilterCondition(str, Enum):
AND = "and"
OR = "or"


@dataclass
class MetadataFilter:
"""Represents a single metadata filter condition."""

key: str
value: Union[int, float, str]
operator: FilterOperator = FilterOperator.EQ

def __repr__(self):
value = f'"{self.value}"' if isinstance(self.value, str) else self.value
return (
f'MetadataFilter(key="{self.key}", '
f"value={value}, "
f'operator="{self.operator.value}")'
)


@dataclass
class CompositeFilter:
"""
Allows grouping multiple MetadataFilter instances using AND/OR.
"""

filters: List[Union[CompositeFilter, MetadataFilter]]
condition: FilterCondition = FilterCondition.AND

def __repr__(self):
return (
f'CompositeFilter(condition="{self.condition.value}", '
f"filters={self.filters})"
)


def get_filter_clause(filters: CompositeFilter) -> str:
"""
Get the filter clause for a given CompositeFilter.
"""

subclauses = []
for filter in filters.filters:
subclause = ""

if isinstance(filter, CompositeFilter):
subclause = get_filter_clause(filter)
elif isinstance(filter, MetadataFilter):
formatted_value = (
f'"{filter.value}"'
if isinstance(filter.value, str)
else filter.value
)

# Simple comparison operators
if filter.operator in {
FilterOperator.EQ,
FilterOperator.NE,
FilterOperator.GT,
FilterOperator.GTE,
FilterOperator.LT,
FilterOperator.LTE,
FilterOperator.LIKE,
FilterOperator.ILIKE,
}:
subclause = (
f'<typeof {formatted_value}>json_get(.metadata, "{filter.key}") '
f"{filter.operator.value} {formatted_value}"
)
# casting should be fixed
elif filter.operator in {FilterOperator.IN, FilterOperator.NOT_IN}:
subclause = (
f"{formatted_value} "
f"{filter.operator.value} "
f'<str>json_get(.metadata, "{filter.key}")'
)
# casting should be fixed
# works only for equality, should be updated to support the rest,
# example: select all({1, 2, 3, 4} < 4);
elif filter.operator in {FilterOperator.ANY, FilterOperator.ALL}:
subclause = (
f"{filter.operator.value} ("
f'<str>json_get(.metadata, "{filter.key}")'
f" = {formatted_value})"
)

elif filter.operator == FilterOperator.EXISTS:
subclause = f'exists <str>json_get(.metadata, "{filter.key}")'

# casting should be fixed
# edgeql contains supports different types like range etc which
# we don't support here
elif filter.operator == FilterOperator.CONTAINS:
subclause = (
f'contains (<str>json_get(.metadata, "{filter.key}"), '
f"{formatted_value})"
)
else:
raise ValueError(f"Unknown operator: {filter.operator}")

subclauses.append(subclause)

if filters.condition in {FilterCondition.AND, FilterCondition.OR}:
filter_clause = f" {filters.condition.value} ".join(subclauses)
return (
"(" + filter_clause + ")" if len(subclauses) > 1 else filter_clause
)
else:
raise ValueError(f"Unknown condition: {filters.condition}")
Loading
Loading