Skip to content

Keeping images breaks parsing #842

@eugeniobaglieri

Description

@eugeniobaglieri

I’m testing trafilatura 2.0.0 (CLI and Python) and noticed unexpected behavior with this URL:
https://ricette.giallozafferano.it/Gnocchi-con-crema-di-zucchine.html

Running:

 trafilatura -u https://ricette.giallozafferano.it/Gnocchi-con-crema-di-zucchine.html --markdown

produces a consistent extraction: without_images.md
However, adding --images changes the output significantly:

trafilatura -u https://ricette.giallozafferano.it/Gnocchi-con-crema-di-zucchine.html --markdown --images

Results in:

  • different and incomplete content extraction
  • markdown formatting is not respected anymore
  • images are still not included

with_images.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions