Skip to content

ValueError: attempt to get argmin of an empty sequence when converting PDF with no headings #88

Open
@lobstercare

Description

@lobstercare

When attempting to convert a PDF document that doesn't contain any detectable headings using document_to_markdown, a ValueError is raised:

ValueError: attempt to get argmin of an empty sequence

This error occurs within the add_heading_level_metadata function in raglite document_to_markdown.py. It's triggered when the heading_font_sizes array is empty, indicating that no heading font sizes could be determined from the PDF.

I guess it's trying to find the minimum value in an empty NumPy array (heading_font_sizes). This suggests that the PDF file that i have may not have any detectable headings or their font sizes are not being recognized correctly.

Here's the pdf file :

groovy.pdf

------------- Full error --------------------------------------------------


ValueError Traceback (most recent call last)
in <cell line: 0>()
1 doc_path=Path(file_path)
----> 2 doc=document_to_markdown(doc_path)
3 doc

4 frames
/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in document_to_markdown(doc_path)
202 # Parse the PDF with pdftext and convert it to Markdown.
203 pages = dictionary_output(doc_path, sort=True, keep_chars=False)
--> 204 doc = "\n\n".join(parsed_pdf_to_markdown(pages))
205 else:
206 try:

/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in parsed_pdf_to_markdown(pages)
184
185 # Add heading level metadata.
--> 186 pages = add_heading_level_metadata(pages)
187 # Add emphasis metadata.
188 pages = add_emphasis_metadata(pages)

/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in add_heading_level_metadata(pages)
75 idx = 6
76 else:
---> 77 idx = np.argmin(np.abs(heading_font_sizes - span_font_size)) # type: ignore[assignment]
78 span["md"]["heading_level"] = idx + 1
79 heading_level[idx] += len(span["text"])

/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in argmin(a, axis, out, keepdims)
1323 """
1324 kwds = {'keepdims': keepdims} if keepdims is not np._NoValue else {}
-> 1325 return _wrapfunc(a, 'argmin', axis=axis, out=out, **kwds)
1326
1327

/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
57
58 try:
---> 59 return bound(*args, **kwds)
60 except TypeError:
61 # A TypeError occurs if the object does have such a method in its

ValueError: attempt to get argmin of an empty sequence

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions