Skip to content

Entity escaping from innerHTML() and html() #49

Open
@technosophos

Description

@technosophos

When getting fragments of a document with html() or innerHTML(), some entities (NBSP, BULL) are not escaped on output, but are left in as UTF-8 character sequences.

This causes other PHP functions (like htmlentities()) to do weird things when the encoding argument is passed in.

Examle code:

<?php
$html = "<!DOCTYPE html><html><body>This is a string with a
non&nbsp;breaking space in it</body></html>";
$QP = htmlqp($html, 'html', array('convert_to_encoding' => 'utf-8'));
//$QP = htmlqp($html, 'html');
$QP = qp($html, 'html');

echo '1. ' . htmlentities($QP->html()) . PHP_EOL;
echo '2. ' . htmlentities($QP->top('body')->html()) . PHP_EOL;
echo '3. ' . htmlentities($QP->innerhtml()) . PHP_EOL;

echo '4. ' . htmlentities($QP->top('body')->html(), ENT_COMPAT, 'utf-8') . PHP_EOL;
?>

Only 1 and 4 encode the entities as expected.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions