Skip to content

fix: add doc_type to Weaviate properties and default Vector attributes#33398

Open
RickDamon wants to merge 2 commits intolanggenius:mainfrom
RickDamon:fix/weaviate-doc-type-metadata
Open

fix: add doc_type to Weaviate properties and default Vector attributes#33398
RickDamon wants to merge 2 commits intolanggenius:mainfrom
RickDamon:fix/weaviate-doc-type-metadata

Conversation

@RickDamon
Copy link

Fixes #33388. Weaviate's search_by_vector only returns properties listed in return_properties. The doc_type field was missing from both the default attributes list and Weaviate's schema definition, causing multimodal image retrieval to fail on Docker deployments (which use Weaviate by default).

Changes:

  • Add doc_type to Vector class default attributes list
  • Add doc_type property in Weaviate _create_collection schema
  • Add doc_type check in Weaviate _ensure_properties for existing collections

Important

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

Fixes #33388

Problem

When using multimodal embedding (e.g., Tongyi multimodal-embedding-v1) to index documents containing images, the image retrieval fails with empty results on Docker deployments (which use Weaviate as the default vector database).

Root Cause: Weaviate's search_by_vector and search_by_full_text methods only return properties explicitly listed in the return_properties parameter. The doc_type field was missing from:

  1. The default attributes list in the Vector class (vector_factory.py)
  2. Weaviate's collection schema definition (_create_collection and _ensure_properties)

This caused doc_type to be None in search results, so image documents were incorrectly routed to index_node_ids (which queries DocumentSegment by segment index node ID) instead of image_doc_ids (which queries UploadFile), leading to empty retrieval results.

Note: This issue only affects Weaviate users. Other vector databases (Qdrant, Milvus, PgVector, Chroma, Elasticsearch, etc.) return all stored metadata by default, so doc_type is preserved in their search results.

Changes

  • Add "doc_type" to Vector class default attributes list, so Weaviate includes it in return_properties during search
  • Add doc_type (TEXT type) property definition in Weaviate _create_collection() schema for new collections
  • Add doc_type check in Weaviate _ensure_properties() to automatically add the property to existing collections

Screenshots

Not applicable (backend-only fix, no UI changes).

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint and make type-check (backend) and cd web && npx lint-staged (frontend) to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 13, 2026
@github-actions github-actions bot added the needs-revision for anti-slop label Mar 13, 2026
@dosubot dosubot bot added the 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. label Mar 13, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where multimodal image retrieval failed on Docker deployments utilizing Weaviate as the vector database. The problem stemmed from Weaviate's search methods not returning the doc_type field, leading to incorrect routing of image documents. The fix ensures that doc_type is consistently managed within Weaviate's schema and included in search results, thereby restoring proper multimodal retrieval functionality.

Highlights

  • Weaviate doc_type attribute handling: The doc_type field has been added to the default attributes list for the Vector class, ensuring it is consistently included in Weaviate operations.
  • Weaviate schema definition update: The doc_type property is now explicitly defined as a TEXT type in Weaviate's _create_collection schema for new collections.
  • Existing Weaviate collection property migration: A mechanism has been implemented in _ensure_properties to automatically add the doc_type property to existing Weaviate collections if it is missing.
  • Comprehensive unit tests: New unit tests have been added to verify the correct handling of doc_type across Weaviate collection creation, property migration, and search operations (vector and full-text).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • api/core/rag/datasource/vdb/vector_factory.py
    • Added 'doc_type' to the default list of attributes for the Vector class.
  • api/core/rag/datasource/vdb/weaviate/weaviate_vector.py
    • Included 'doc_type' as a TEXT property in the _create_collection method for new Weaviate schemas.
    • Implemented a check in _ensure_properties to add 'doc_type' to existing Weaviate collections if it's not present.
  • api/tests/unit_tests/core/rag/datasource/vdb/weaviate/test_weaviate_vector.py
    • Added new unit tests to validate doc_type inclusion in Vector class initialization.
    • Included tests to confirm doc_type is part of the schema when creating new Weaviate collections.
    • Added tests to verify _ensure_properties correctly adds missing doc_type to existing collections.
    • Implemented tests to ensure search_by_vector and search_by_full_text return doc_type in document metadata.
    • Added a test to confirm add_texts stores doc_type in Weaviate object properties.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a bug where doc_type was missing from Weaviate search results, causing failures in multimodal image retrieval. The changes are well-implemented by updating the default vector attributes and the Weaviate collection schema to include doc_type. The addition of comprehensive unit tests is a great way to ensure the fix is robust and prevents future regressions. I have one minor suggestion to improve the new test code's clarity.

Fixes langgenius#33388. Weaviate's search_by_vector only returns properties listed
in return_properties. The doc_type field was missing from both the default
attributes list and Weaviate's schema definition, causing multimodal
image retrieval to fail on Docker deployments (which use Weaviate by default).

Changes:
- Add doc_type to Vector class default attributes list
- Add doc_type property in Weaviate _create_collection schema
- Add doc_type check in Weaviate _ensure_properties for existing collections
- Add unit tests for doc_type handling in Weaviate vector operations
@RickDamon RickDamon force-pushed the fix/weaviate-doc-type-metadata branch from 154b7fe to 59961a3 Compare March 13, 2026 08:13
@github-actions
Copy link
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-03-13 12:33:05.424910580 +0000
+++ /tmp/pyrefly_pr.txt	2026-03-13 12:32:56.247946792 +0000
@@ -385,7 +385,7 @@
 ERROR Object of class `list` has no attribute `fields` [missing-attribute]
    --> core/rag/datasource/vdb/vikingdb/vikingdb_vector.py:143:55
 ERROR Class member `WeaviateVector._get_uuids` overrides parent class `BaseVector` in an inconsistent manner [bad-param-name-override]
-   --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:237:9
+   --> core/rag/datasource/vdb/weaviate/weaviate_vector.py:240:9
 ERROR `response` may be uninitialized [unbound-name]
    --> core/rag/extractor/firecrawl/firecrawl_app.py:134:16
 ERROR `response` may be uninitialized [unbound-name]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. needs-revision for anti-slop size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Local deployment fails to retrieve images in dataset

1 participant