fix: use Document.from_dict in InMemoryDocumentStore.load_from_disk#11594
Open
Ayushhgit wants to merge 1 commit into
Open
fix: use Document.from_dict in InMemoryDocumentStore.load_from_disk#11594Ayushhgit wants to merge 1 commit into
Ayushhgit wants to merge 1 commit into
Conversation
load_from_disk rebuilt documents with the plain Document constructor, which does not convert nested fields. Documents saved with a blob (ByteStream) or sparse_embedding (SparseEmbedding) came back with those fields as raw dicts, crashing repr(), to_dict(), equality comparison, save_to_disk of the reloaded store, and any component accessing document.blob.data. save_to_disk serializes with Document.to_dict(flatten=False); Document.from_dict is its inverse and restores the proper types. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@Ayushhgit is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
Contributor
|
@Ayushhgit you currently have 3 open PRs and keep opening more. Please, focus on one PR at a time. |
Contributor
Author
|
Hey @davidsbatista these were my last, I'll wait until all current PR's of mine close until starting a new one. Sorry if I caused any inconvenience. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
Proposed Changes:
InMemoryDocumentStore.load_from_diskrebuilt documents with the plainDocument(**doc)constructor, which performs no conversion of nested fields. Sincesave_to_diskserializes withDocument.to_dict(flatten=False)(convertingblobtoByteStream.to_dict()andsparse_embeddingtoSparseEmbedding.to_dict()), any document saved with those fields came back with raw dicts in their place. The corrupted documents crashedrepr(),to_dict(), equality comparison, a secondsave_to_disk, and any component accessingdocument.blob.data(e.g. image pipelines).One-line fix: reconstruct with
Document.from_dict(doc), the documented inverse ofto_dict, which restoresByteStreamandSparseEmbeddinginstances.How did you test it?
test_save_to_disk_and_load_from_disk_with_blob_and_sparse_embedding: saves a document with both abloband asparse_embedding, reloads, asserts proper types, equality with the original, and that the reloaded store can be saved again. Fails onmain, passes with this fix.hatch run test:unit test/document_stores/test_in_memory.py— 148 passed, 4 skipped.Notes for the reviewer
Document.from_dictalso handles the nestedmetadict produced byto_dict(flatten=False), so documents without blob/sparse fields round-trip exactly as before (covered by the existingtest_save_to_disk_and_load_from_disk).Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.🤖 Generated with Claude Code