-
Notifications
You must be signed in to change notification settings - Fork 216
Description
Hi, thanks for the amazing work.
I'm using Agentless to do bug localization on my own data. I encountered this when retrieving documents in Index.py:
File "/home/x/agentless/fl/retrieve.py", line 86, in retrieve_locs
file_names, meta_infos, traj = retriever.retrieve(mock=args.mock)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x/agentless/fl/Index.py", line 270, in retrieve
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
File "/home/x/miniconda3/envs/sec/lib/python3.11/site-packages/llama_index/core/node_parser/text/sentence.py", line 160, in split_text_metadata_aware
raise ValueError(
ValueError: Metadata length (734) is longer than chunk size (512). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
It turns out that, in check_meta_data, when using doc.get_content(mode=MetadataMode.EMBED) to get formatted matadata strings after setting text='', get_content just returns an empty string, so check_meta_data always returns False, instead of filtering out long metadatas.
The fix is quite simple, just replace doc.get_content(metadata_mode=MetadataMode.EMBED) with doc.get_metadata_str(mode=MetadataMode.EMBED). This way, the returned metadata string is complete, whenever the text of the document is empty, and check_meta_data can operate normally.
I'm using llama-index 0.14.4 . Has anyone encountered this issue before?