Skip to content

bug/element type for non-English languages #3044

Open
@cm-halfspace

Description

@cm-halfspace

Describe the bug
When I partition a Danish .docx file I notice some weird classifications of the element types.

I think this is related to the fact that the languages-list is not being set in _parse_paragraph_text_for_element_type, eg in is_possible_narrative_text(text).

If one takes a look at the definition of is_possible_narrative_text it seems that a quick temporary solution would be to at least use language_checks in line 90 such that it instead becomes:

if "eng" in languages and language_checks and (sentence_count(text, 3) < 2) and (not contains_verb(text)):

To Reproduce

from unstructured.partition.text_type import is_possible_narrative_text
text = "Dette er et eksempel på en kort sætning."
is_possible_narrative_text(text)

which returns False right now. With the above quick-fix, it would return True as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingelementRelated to document element schema and classification

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions