Skip to content

[Bug]: Possible mismatch between texts and text_summaries in multimodal_rag_langchain.ipynb #2550

@FisicoEn

Description

@FisicoEn

File Name

gemini/use-cases/retrieval-augmented-generation/multimodal_rag_langchain.ipynb

What happened?

In the notebook gemini/use-cases/retrieval-augmented-generation/multimodal_rag_langchain.ipynb, there seems to be a mismatch between the IDs assigned to the original PDF chunks and the IDs used for the corresponding summaries.

The code defines:
doc_contents = texts + tables + img_base64_list

doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(text_summaries + table_summaries + image_summaries)
]

However, texts (from partition_pdf) and text_summaries (after an additional split via text_splitter.split_text) do not necessarily have the same cardinality.

Relevant log output

Because `doc_ids` is generated from `doc_contents` and then indexed with `i` from `enumerate(text_summaries + table_summaries + image_summaries)`, any change in the number of chunks (e.g., due to re‑chunking with `text_splitter.split_text`) can shift the indices and cause summaries to be associated with the wrong original content.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions