Skip to content

[BUG] Misuse of doc.get_content() in Index.py #83

@CDHZAYN

Description

@CDHZAYN

Hi, thanks for the amazing work.

I'm using Agentless to do bug localization on my own data. I encountered this when retrieving documents in Index.py:

File "/home/x/agentless/fl/retrieve.py", line 86, in retrieve_locs
    file_names, meta_infos, traj = retriever.retrieve(mock=args.mock)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/agentless/fl/Index.py", line 270, in retrieve
    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
File "/home/x/miniconda3/envs/sec/lib/python3.11/site-packages/llama_index/core/node_parser/text/sentence.py", line 160, in split_text_metadata_aware
    raise ValueError(
ValueError: Metadata length (734) is longer than chunk size (512). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

It turns out that, in check_meta_data, when using doc.get_content(mode=MetadataMode.EMBED) to get formatted matadata strings after setting text='', get_content just returns an empty string, so check_meta_data always returns False, instead of filtering out long metadatas.

The fix is quite simple, just replace doc.get_content(metadata_mode=MetadataMode.EMBED) with doc.get_metadata_str(mode=MetadataMode.EMBED). This way, the returned metadata string is complete, whenever the text of the document is empty, and check_meta_data can operate normally.

I'm using llama-index 0.14.4 . Has anyone encountered this issue before?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions