[BUG] Misuse of doc.get_content() in Index.py

Hi, thanks for the amazing work.

I'm using Agentless to do bug localization on my own data. I encountered this when retrieving documents in `Index.py`:

```
File "/home/x/agentless/fl/retrieve.py", line 86, in retrieve_locs
    file_names, meta_infos, traj = retriever.retrieve(mock=args.mock)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/agentless/fl/Index.py", line 270, in retrieve
    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
File "/home/x/miniconda3/envs/sec/lib/python3.11/site-packages/llama_index/core/node_parser/text/sentence.py", line 160, in split_text_metadata_aware
    raise ValueError(
ValueError: Metadata length (734) is longer than chunk size (512). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
```

It turns out that, in `check_meta_data`, when using [doc.get_content(mode=MetadataMode.EMBED)](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/schema.py#L642) to get formatted matadata strings after setting `text=''`, `get_content` just returns an empty string, so `check_meta_data` always returns False, instead of filtering out long metadatas.

The fix is quite simple, just replace `doc.get_content(metadata_mode=MetadataMode.EMBED)` with `doc.get_metadata_str(mode=MetadataMode.EMBED)`. This way, the returned metadata string is complete, whenever the text of the document is empty, and `check_meta_data` can operate normally.

I'm using llama-index 0.14.4 . Has anyone encountered this issue before?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Misuse of doc.get_content() in Index.py #83

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Misuse of doc.get_content() in Index.py #83

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions