Harden vector store query filter construction#21899
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Hardens multiple vector store implementations against injection-style issues by parameterizing/escaping ref_doc_id (and other filter inputs) and adding regression tests to validate the safer behavior.
Changes:
- Azure Cosmos DB NoSQL: parameterize
delete()query instead of interpolatingref_doc_id. - Azure AI Search + Alibaba Cloud OpenSearch: escape single quotes in
ref_doc_idused in filter expressions for (a)sync deletes. - AnalyticDB: validate metadata keys and escape SQL string values when building filter clauses; add tests for escaping and unsafe keys.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| llama-index-integrations/vector_stores/llama-index-vector-stores-azurecosmosnosql/tests/test_azurecosmosnosql.py | Adds regression test ensuring delete() uses Cosmos parameterized queries. |
| llama-index-integrations/vector_stores/llama-index-vector-stores-azurecosmosnosql/llama_index/vector_stores/azurecosmosnosql/base.py | Switches delete() to use query parameters for ref_doc_id. |
| llama-index-integrations/vector_stores/llama-index-vector-stores-azureaisearch/tests/test_azureaisearch.py | Adds sync/async tests verifying ref_doc_id escaping in OData filters. |
| llama-index-integrations/vector_stores/llama-index-vector-stores-azureaisearch/llama_index/vector_stores/azureaisearch/base.py | Escapes ' in ref_doc_id before embedding into OData filter. |
| llama-index-integrations/vector_stores/llama-index-vector-stores-analyticdb/tests/test_analyticdb.py | Adds tests for SQL escaping and rejecting unsafe metadata keys. |
| llama-index-integrations/vector_stores/llama-index-vector-stores-analyticdb/llama_index/vector_stores/analyticdb/base.py | Adds metadata key validation + SQL string escaping in filter clause builder. |
| llama-index-integrations/vector_stores/llama-index-vector-stores-alibabacloud-opensearch/tests/test_vector_stores_alibabacloud_opensearch.py | Adds async delete test verifying ref_doc_id escaping in filter. |
| llama-index-integrations/vector_stores/llama-index-vector-stores-alibabacloud-opensearch/llama_index/vector_stores/alibabacloud_opensearch/base.py | Escapes ' in ref_doc_id before building OpenSearch filter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| elif filter_.operator == FilterOperator.CONTAINS: | ||
| return ( | ||
| f"metadata_::jsonb->'{filter_.key}' {adb_operator} '[\"{filter_.value}\"]'" | ||
| f"metadata_::jsonb->'{key}' {adb_operator} '[\"{_escape_sql_str(filter_.value)}\"]'" | ||
| ) |
| def _validate_metadata_key(key: str) -> str: | ||
| # The key becomes part of the SQL/JSON-path text; allow a safe identifier charset only. | ||
| if not re.fullmatch(r"[A-Za-z0-9_]+", str(key)): | ||
| raise ValueError(f"Invalid metadata filter key: {key!r}") | ||
| return str(key) |
| query=( | ||
| "SELECT c.id, c.id AS partitionKey FROM c " | ||
| f"WHERE c.{self._metadata_key}.ref_doc_id = @ref_doc_id" | ||
| ), | ||
| parameters=[{"name": "@ref_doc_id", "value": ref_doc_id}], |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d269ce042d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not re.fullmatch(r"[A-Za-z0-9_]+", str(key)): | ||
| raise ValueError(f"Invalid metadata filter key: {key!r}") |
There was a problem hiding this comment.
Preserve valid JSON metadata keys
When an existing metadata key contains a non-identifier character such as source.file, content-type, or a space, this new validation makes every AnalyticDB filtered query raise ValueError even though those keys are valid in BaseNode.metadata and are stored in the metadata_ JSON object. Since the key is already used as a quoted JSON key (metadata_->>'...'), escaping quotes in the key would harden the SQL without rejecting legitimate metadata schemas.
Useful? React with 👍 / 👎.
|
Updated AnalyticDB filter construction so metadata keys and values are escaped before being embedded in SQL filter fragments. This preserves valid metadata keys such as source.file, content-type, and keys with spaces while hardening generated query text. Focused vector store tests pass: 23 passed, 4 skipped. |
Summary
Tests
Result:
23 passed, 4 skipped.