Skip to content

Ignore invalid HTML (or self closed tags) #251

Open
@titospeakap

Description

@titospeakap

Up to version 2.2.0, the following HTML code will be fully parsed

<H1>Heading 1</h1>
<p>Paragraph
<b>Second</b> line.</p>
<ul><li>List item 1</li><li>List item 2<ul><li>List item 2.1</li><li>List item 2.2</li></ul></li><li>List item 3</ul>
<p>Paragraph 2</p>
<h2>Heading 2</h2>
<p>Paragraph 3</p>
<p><img alt="image" width="100" height="20"></p>
<audio />
<video />
<p><a data-rel="attachment">attachment</a></p>
<p>Another paragraph. <a href="http://url.to.link">Hyperlink</a>.</p>
<ol><li>List item 1</li><li>List item 2<ol><li>List item 2.1</li><li>List item 2.2</li></ol></li><li>List item 3</ol>

In more recent versions, it stops parsing at the tag <audio /> (if I change to be <audio></audio>, it works), but no errors are generated (->hasErrors() returns false).

Is this behaviour intentional? and is there a way in more recent version to replicate what happens in version 2.2.0 or below?

For the HTML shared above, here is the code I'm running

$html5 = new HTML5();
$html5->loadHTMLFragment($html);
foreach ($fragment->childNodes as $child) {
        echo $child->nodeName . "\n";
 }

And the respective output in version 2.9.0:

h1
#text
p
#text
ul
#text
p
#text
h2
#text
p
#text
p
#text
audio

but for version 2.2.0, I get

h1
#text
p
#text
ul
#text
p
#text
h2
#text
p
#text
p
#text
audio
#text
video
#text
p
#text
p
#text
ol

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions