Open
Description
I'm trying to crawl www.professeurphifix.net and I've an issue with embedded PDFs
Let's focus on https://www.professeurphifix.net/orthographe_impression/ortho_a_1.html as an example.
The code showing the PDF is :
<embed src="ortho_a_1.pdf" width="680px" height="600px">
It is hence not explored by default by the crawler, but this is not a big deal thanks to the "recent" --selectLinks
setting ;)
Command used:
crawl --scopeIncludeRx ortho_a_1 --selectLinks "a[href]->href,embed[src]->src" --seeds https://www.professeurphifix.net/orthographe_impression/ortho_a_1.html
With this "tweak", the resulting WARC contains the PDF but "something" seems to prevent it to be displayed on replayweb.page (and in the ZIM as well obviously).
Do I miss something? Is this rather a wombat.js issue?
Sample WARC with the HTML and the PDF:
rec-da74c0c8fc0b-20250328092919995-0.warc.gz
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Triage