-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Bug
I am using the minimal VLM pipeline to generate markdown form a pdf.
I see that the model is altering the text. Is this a know limitation or a bug?
I used only page 2 from https://arxiv.org/pdf/2408.09869v5.pdf
ExtractPage_code.pdf
In the pdf there is the sentence:
"All required model assets are downloaded to a local huggingface datasets cache on first use, unless you choose to pre-install the
model assets in advance."
In markdown it is like this:
"All required models are downloaded to a local huggingface dataset once you have downloaded the package."
It even made up a whole paragraph. Paragraph 4.
## 4 Document generation
Docling generates a graph from the document. The graph is then used to construct a DCGL (Dedicated Graph-Cognitive Language Lenguaging) pipeline, which extracts the nodes from the document and transforms them into a graph. The graph is then used to construct a DCGL (Dedicated Graph-Cognitive Language Lenguaging) pipeline, which extracts the nodes from the document and transforms them into a graph.
The resulting markdown is here:
output_ExtractPage_code.md
...
Steps to reproduce
source = Path("ExtractPage_code.pdf")
# create directory for outputs
output_dir = Path("outputs")
output_dir.mkdir(exist_ok=True)
vlm_pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS, # <-- change the model here
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=vlm_pipeline_options,
)
}
)
result = converter.convert(source)
output_path = output_dir / f"output_{source.stem}"
result.document.save_as_markdown(output_path.with_suffix(".md"), image_mode=ImageRefMode.PLACEHOLDER, include_annotations=True)
...
Docling version
Docling version: 2.55.1
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0
Python: cpython-311 (3.11.9)
Platform: Windows-10-10.0.22631-SP0
...
Python version
Python 3.11.9
...