I’m testing trafilatura 2.0.0 (CLI and Python) and noticed unexpected behavior with this URL:
https://ricette.giallozafferano.it/Gnocchi-con-crema-di-zucchine.html
Running:
trafilatura -u https://ricette.giallozafferano.it/Gnocchi-con-crema-di-zucchine.html --markdown
produces a consistent extraction: without_images.md
However, adding --images changes the output significantly:
trafilatura -u https://ricette.giallozafferano.it/Gnocchi-con-crema-di-zucchine.html --markdown --images
Results in:
- different and incomplete content extraction
- markdown formatting is not respected anymore
- images are still not included
with_images.md
I’m testing trafilatura 2.0.0 (CLI and Python) and noticed unexpected behavior with this URL:
https://ricette.giallozafferano.it/Gnocchi-con-crema-di-zucchine.html
Running:
produces a consistent extraction: without_images.md
However, adding --images changes the output significantly:
Results in:
with_images.md