-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
Description
Describe the bug
passing unstructured.cleaners.core.group_bullet_paragraph to UnstructuredBaseLoader's post_processors will cause the code to break, because group_bullet_paragraph returns a List[str], and unstructured.documents.elements.Text.apply() method checks the output of group_bullet_paragraph, and throws an error if it is not str, see here:
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")To Reproduce
loader = UnstructuredFileLoader("some_file_that_has_bullet_points.pdf",
mode="elements",
pdf_infer_table_structure=True,
skip_infer_table_types=['jpg', 'png', 'xls', 'xlsx'],
show_progress=True,
post_processors=[group_bullet_paragraph]
)
docs = loader.load()Expected behavior
The list of strings should be joined.
Proposing replacing:
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")with something like:
if isinstance(cleaned_text, list):
cleaned_text = " ".join(cleaned_text)
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")Reactions are currently unavailable