Skip to content

Handling special characters #16

Open
@pvcastro

Description

@pvcastro

Hi @ines , how are you?

How can I handle special characters, such as accented characters, ª, º, ç, etc? Some PDFs I'm processing for Portuguese have lots of these characters, and I'm getting some errors extracting text from them, such as:

Na forma de jurisprudncia **(should be jurisprudência)** do Superior Tribunal de Justiça - AgRg no REsp 1269246/RS, Rel. Ministro Luis Felipe Salomªo **(should be Salomão)** -, danos morais in re ipsa, em casos de atraso de voos somente sªo **(should be são)** constatados em tempo de demora superior a oito (08) horas.

No caso concreto, o tempo foi de cerca de cinco horas e, nªo havendo circunstâncias extraordinÆrias **(should be extraordinárias)**, excluem-se os danos morais.

Juiz JosØ **(should be José)** ...

5052582-09.2020.8.09.0051-212510970_Voto.pdf

Code is simple:

nlp = spacy.load('pt_core_news_lg')
layout = spaCyLayout(nlp)
doc = layout(sample_path)

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    doclingRelated to Docling library and modelsenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions