fix: validate Weaviate collection existence before trusting Redis cache#33376
fix: validate Weaviate collection existence before trusting Redis cache#33376majiayu000 wants to merge 4 commits intolanggenius:mainfrom
Conversation
When the Redis cache key `vector_indexing_{collection_name}` exists but
the Weaviate class has been deleted externally (container restart,
manual cleanup), `_create_collection()` skips schema creation. This
causes Weaviate auto-schema to infer wrong property types (uuid instead
of text), resulting in silent RAG retrieval failure with zero vectors.
Add a collection existence check when Redis cache hits. If the class no
longer exists, clear the stale cache and proceed with proper schema
creation.
Fixes langgenius#32458
Signed-off-by: majiayu000 <1835304752@qq.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where a stale Redis cache entry could lead to incorrect Weaviate schema creation if the corresponding collection was externally deleted. The fix introduces a robust validation step to confirm the Weaviate collection's existence when the cache indicates it's present. If the collection is not found, the stale cache is invalidated, preventing silent data retrieval failures caused by schema mismatches and ensuring data consistency. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-03-12 16:12:03.267220963 +0000
+++ /tmp/pyrefly_pr.txt 2026-03-12 16:11:54.569185147 +0000
@@ -385,7 +385,7 @@
ERROR Object of class `list` has no attribute `fields` [missing-attribute]
--> core/rag/datasource/vdb/vikingdb/vikingdb_vector.py:143:55
ERROR Class member `WeaviateVector._get_uuids` overrides parent class `BaseVector` in an inconsistent manner [bad-param-name-override]
- --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:237:9
+ --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:243:9
ERROR `response` may be uninitialized [unbound-name]
--> core/rag/extractor/firecrawl/firecrawl_app.py:134:16
ERROR `response` may be uninitialized [unbound-name]
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where a stale Redis cache could lead to silent data indexing failures in Weaviate. The proposed solution, which validates the existence of a Weaviate collection even on a cache hit, is sound and directly tackles the root cause. My review includes one suggestion to improve efficiency by avoiding a redundant check that is introduced in the new logic for handling stale cache entries.
| if self._client.collections.exists(self._collection_name): | ||
| return | ||
| redis_client.delete(cache_key) | ||
| logger.warning( | ||
| "Stale Redis cache for collection %s: class deleted externally, recreating", | ||
| self._collection_name, | ||
| ) |
There was a problem hiding this comment.
This change correctly handles stale cache entries. However, in the case of a stale cache (when self._client.collections.exists returns False), the code proceeds to the try block below, where self._client.collections.exists() is called again on line 189. This results in a redundant network call.
Since this happens while holding a Redis lock, it unnecessarily increases the duration the lock is held and could increase lock contention. Consider refactoring to avoid this second check, for instance, by storing the result of the first exists() call in a variable and reusing it.
Signed-off-by: majiayu000 <1835304752@qq.com>
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-03-12 16:38:51.468976828 +0000
+++ /tmp/pyrefly_pr.txt 2026-03-12 16:38:41.329860508 +0000
@@ -385,7 +385,7 @@
ERROR Object of class `list` has no attribute `fields` [missing-attribute]
--> core/rag/datasource/vdb/vikingdb/vikingdb_vector.py:143:55
ERROR Class member `WeaviateVector._get_uuids` overrides parent class `BaseVector` in an inconsistent manner [bad-param-name-override]
- --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:237:9
+ --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:243:9
ERROR `response` may be uninitialized [unbound-name]
--> core/rag/extractor/firecrawl/firecrawl_app.py:134:16
ERROR `response` may be uninitialized [unbound-name]
|
Cache the exists() result to avoid calling it twice when the Redis cache is stale. On the stale-cache path we already know the collection does not exist, so skip the second check and proceed directly to creation. This reduces lock hold time. Signed-off-by: majiayu000 <1835304752@qq.com>
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-03-13 01:59:05.751075287 +0000
+++ /tmp/pyrefly_pr.txt 2026-03-13 01:58:57.163994430 +0000
@@ -385,7 +385,7 @@
ERROR Object of class `list` has no attribute `fields` [missing-attribute]
--> core/rag/datasource/vdb/vikingdb/vikingdb_vector.py:143:55
ERROR Class member `WeaviateVector._get_uuids` overrides parent class `BaseVector` in an inconsistent manner [bad-param-name-override]
- --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:237:9
+ --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:249:9
ERROR `response` may be uninitialized [unbound-name]
--> core/rag/extractor/firecrawl/firecrawl_app.py:134:16
ERROR `response` may be uninitialized [unbound-name]
|
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-03-13 15:37:07.663472050 +0000
+++ /tmp/pyrefly_pr.txt 2026-03-13 15:36:58.526402790 +0000
@@ -385,7 +385,7 @@
ERROR Object of class `list` has no attribute `fields` [missing-attribute]
--> core/rag/datasource/vdb/vikingdb/vikingdb_vector.py:143:55
ERROR Class member `WeaviateVector._get_uuids` overrides parent class `BaseVector` in an inconsistent manner [bad-param-name-override]
- --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:237:9
+ --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:249:9
ERROR `response` may be uninitialized [unbound-name]
--> core/rag/extractor/firecrawl/firecrawl_app.py:134:16
ERROR `response` may be uninitialized [unbound-name]
|
Summary
When the Redis cache key
vector_indexing_{collection_name}is still alive but the Weaviate class has been deleted externally (container restart, manual cleanup, ephemeral storage),_create_collection()trusts the stale cache and skips schema creation. This causes Weaviate's auto-schema to infer wrong property types (uuidinstead oftextfordoc_id,document_id), resulting in silent RAG retrieval failure — all documents show as "completed" but return 0 search results because vectors are silently dropped.This PR adds a collection existence check when the Redis cache hits. If the Weaviate class no longer exists, the stale cache key is cleared and schema creation proceeds normally with the correct property types.
Fixes #32458
Changes
self._client.collections.exists()validation in_create_collection()when Redis cache indicates the collection was already indexedTest plan
tests/integration_tests/vdb/weaviate/test_weaviate.pycovers the Weaviate vector store operations