Skip to content

fix: use Document.from_dict in InMemoryDocumentStore.load_from_disk#11594

Open
Ayushhgit wants to merge 1 commit into
deepset-ai:mainfrom
Ayushhgit:fix-load-from-disk-document-from-dict
Open

fix: use Document.from_dict in InMemoryDocumentStore.load_from_disk#11594
Ayushhgit wants to merge 1 commit into
deepset-ai:mainfrom
Ayushhgit:fix-load-from-disk-document-from-dict

Conversation

@Ayushhgit

Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

InMemoryDocumentStore.load_from_disk rebuilt documents with the plain Document(**doc) constructor, which performs no conversion of nested fields. Since save_to_disk serializes with Document.to_dict(flatten=False) (converting blob to ByteStream.to_dict() and sparse_embedding to SparseEmbedding.to_dict()), any document saved with those fields came back with raw dicts in their place. The corrupted documents crashed repr(), to_dict(), equality comparison, a second save_to_disk, and any component accessing document.blob.data (e.g. image pipelines).

One-line fix: reconstruct with Document.from_dict(doc), the documented inverse of to_dict, which restores ByteStream and SparseEmbedding instances.

How did you test it?

  • New regression test test_save_to_disk_and_load_from_disk_with_blob_and_sparse_embedding: saves a document with both a blob and a sparse_embedding, reloads, asserts proper types, equality with the original, and that the reloaded store can be saved again. Fails on main, passes with this fix.
  • hatch run test:unit test/document_stores/test_in_memory.py — 148 passed, 4 skipped.

Notes for the reviewer

  • Document.from_dict also handles the nested meta dict produced by to_dict(flatten=False), so documents without blob/sparse fields round-trip exactly as before (covered by the existing test_save_to_disk_and_load_from_disk).

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

🤖 Generated with Claude Code

load_from_disk rebuilt documents with the plain Document constructor,
which does not convert nested fields. Documents saved with a blob
(ByteStream) or sparse_embedding (SparseEmbedding) came back with those
fields as raw dicts, crashing repr(), to_dict(), equality comparison,
save_to_disk of the reloaded store, and any component accessing
document.blob.data.

save_to_disk serializes with Document.to_dict(flatten=False);
Document.from_dict is its inverse and restores the proper types.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Ayushhgit Ayushhgit requested a review from a team as a code owner June 12, 2026 08:08
@Ayushhgit Ayushhgit requested review from davidsbatista and removed request for a team June 12, 2026 08:08
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

@Ayushhgit is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@davidsbatista

Copy link
Copy Markdown
Contributor

@Ayushhgit you currently have 3 open PRs and keep opening more. Please, focus on one PR at a time.

@Ayushhgit

Copy link
Copy Markdown
Contributor Author

Hey @davidsbatista these were my last, I'll wait until all current PR's of mine close until starting a new one. Sorry if I caused any inconvenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: InMemoryDocumentStore.load_from_disk corrupts documents with blob or sparse_embedding (loaded as raw dicts)

2 participants