-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Description
File Name
gemini/use-cases/retrieval-augmented-generation/multimodal_rag_langchain.ipynb
What happened?
In the notebook gemini/use-cases/retrieval-augmented-generation/multimodal_rag_langchain.ipynb, there seems to be a mismatch between the IDs assigned to the original PDF chunks and the IDs used for the corresponding summaries.
The code defines:
doc_contents = texts + tables + img_base64_list
doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(text_summaries + table_summaries + image_summaries)
]
However, texts (from partition_pdf) and text_summaries (after an additional split via text_splitter.split_text) do not necessarily have the same cardinality.
Relevant log output
Because `doc_ids` is generated from `doc_contents` and then indexed with `i` from `enumerate(text_summaries + table_summaries + image_summaries)`, any change in the number of chunks (e.g., due to re‑chunking with `text_splitter.split_text`) can shift the indices and cause summaries to be associated with the wrong original content.Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
No labels