Description
When attempting to convert a PDF document that doesn't contain any detectable headings using document_to_markdown, a ValueError is raised:
ValueError: attempt to get argmin of an empty sequence
This error occurs within the add_heading_level_metadata function in raglite document_to_markdown.py. It's triggered when the heading_font_sizes array is empty, indicating that no heading font sizes could be determined from the PDF.
I guess it's trying to find the minimum value in an empty NumPy array (heading_font_sizes). This suggests that the PDF file that i have may not have any detectable headings or their font sizes are not being recognized correctly.
Here's the pdf file :
------------- Full error --------------------------------------------------
ValueError Traceback (most recent call last)
in <cell line: 0>()
1 doc_path=Path(file_path)
----> 2 doc=document_to_markdown(doc_path)
3 doc
4 frames
/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in document_to_markdown(doc_path)
202 # Parse the PDF with pdftext and convert it to Markdown.
203 pages = dictionary_output(doc_path, sort=True, keep_chars=False)
--> 204 doc = "\n\n".join(parsed_pdf_to_markdown(pages))
205 else:
206 try:
/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in parsed_pdf_to_markdown(pages)
184
185 # Add heading level metadata.
--> 186 pages = add_heading_level_metadata(pages)
187 # Add emphasis metadata.
188 pages = add_emphasis_metadata(pages)
/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in add_heading_level_metadata(pages)
75 idx = 6
76 else:
---> 77 idx = np.argmin(np.abs(heading_font_sizes - span_font_size)) # type: ignore[assignment]
78 span["md"]["heading_level"] = idx + 1
79 heading_level[idx] += len(span["text"])
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in argmin(a, axis, out, keepdims)
1323 """
1324 kwds = {'keepdims': keepdims} if keepdims is not np._NoValue else {}
-> 1325 return _wrapfunc(a, 'argmin', axis=axis, out=out, **kwds)
1326
1327
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
57
58 try:
---> 59 return bound(*args, **kwds)
60 except TypeError:
61 # A TypeError occurs if the object does have such a method in its
ValueError: attempt to get argmin of an empty sequence