Open
Description
Describe the bug
When I partition a Danish .docx file I notice some weird classifications of the element types.
I think this is related to the fact that the languages
-list is not being set in _parse_paragraph_text_for_element_type, eg in is_possible_narrative_text(text)
.
If one takes a look at the definition of is_possible_narrative_text
it seems that a quick temporary solution would be to at least use language_checks
in line 90 such that it instead becomes:
if "eng" in languages and language_checks and (sentence_count(text, 3) < 2) and (not contains_verb(text)):
To Reproduce
from unstructured.partition.text_type import is_possible_narrative_text
text = "Dette er et eksempel på en kort sætning."
is_possible_narrative_text(text)
which returns False
right now. With the above quick-fix, it would return True
as expected.