Open
Description
Describe the bug
passing unstructured.cleaners.core.group_bullet_paragraph
to UnstructuredBaseLoader
's post_processors
will cause the code to break, because group_bullet_paragraph
returns a List[str]
, and unstructured.documents.elements.Text.apply()
method checks the output of group_bullet_paragraph
, and throws an error if it is not str
, see here:
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")
To Reproduce
loader = UnstructuredFileLoader("some_file_that_has_bullet_points.pdf",
mode="elements",
pdf_infer_table_structure=True,
skip_infer_table_types=['jpg', 'png', 'xls', 'xlsx'],
show_progress=True,
post_processors=[group_bullet_paragraph]
)
docs = loader.load()
Expected behavior
The list of strings should be joined.
Proposing replacing:
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")
with something like:
if isinstance(cleaned_text, list):
cleaned_text = " ".join(cleaned_text)
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")