Skip to content

Issue with embed PDFs on www.professeurphifix.net #801

Open
@benoit74

Description

@benoit74

I'm trying to crawl www.professeurphifix.net and I've an issue with embedded PDFs

Let's focus on https://www.professeurphifix.net/orthographe_impression/ortho_a_1.html as an example.

The code showing the PDF is :

<embed src="ortho_a_1.pdf" width="680px" height="600px">

It is hence not explored by default by the crawler, but this is not a big deal thanks to the "recent" --selectLinks setting ;)

Command used:

crawl --scopeIncludeRx ortho_a_1 --selectLinks "a[href]->href,embed[src]->src" --seeds https://www.professeurphifix.net/orthographe_impression/ortho_a_1.html

With this "tweak", the resulting WARC contains the PDF but "something" seems to prevent it to be displayed on replayweb.page (and in the ZIM as well obviously).

Do I miss something? Is this rather a wombat.js issue?

Sample WARC with the HTML and the PDF:
rec-da74c0c8fc0b-20250328092919995-0.warc.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions