Skip to content

ConvertFrom-Html parses special characters as question marks #7

@dominikduennebacke

Description

@dominikduennebacke

Hi there. Really appreciate this module using PowerShell Core. Thank you for your work!

Scraping some European websites I came across an issue in regards to special characters, like ü, ä, ö, é, ß, etc.
Somehow ConvertFrom-Html cannot handle these characters and parses them as question marks. It seems to be related to the encoding which cannot be specified by any parameter.

Any ideas how to solve this?

Example

Invoke-WebRequest content show the "ü" character correctly

$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"
$Result.Content -split "<" | Where-Object {$_ -like '*span class="box">*'}

>> span class="box">ü

ConvertFrom-Html parses that into "??"

$Html = ConvertFrom-Html -Content $Result
$Html.SelectNodes('//span[@class="box"]')

>> NodeType Name AttributeCount ChildNodeCount ContentLength InnerText
>> -------- ---- -------------- -------------- ------------- ---------
>> Element  span 1              1              2             ??

Return headers show correct content-type utf-8

$Result.Headers

>> Key             Value
>> ---             -----
>> Server          {nginx}
>> Date            {Sun, 03 Oct 2021 10:22:56 GMT}
>> Connection      {keep-alive}
>> X-Powered-By    {Express}
>> Accept-Ranges   {bytes}
>> Cache-Control   {public, max-age=0}
>> ETag            {W/"aabd-17a2d88a25f"}
>> X-Response-Time {0}
>> Vary            {Accept-Encoding}
>> Content-Type    {text/html; charset=utf-8}
>> Content-Length  {43709}
>> Last-Modified   {Mon, 21 Jun 2021 07:46:07 GMT}

Version info

$PSVersionTable

>> Name                           Value
>> ----                           -----
>> PSVersion                      7.1.3
>> PSEdition                      Core
>> GitCommitId                    7.1.3
>> OS                             Darwin 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec  2 20:40:21 PST 2020; root:xnu-7195.60.75~1/RELEASE_ARM64_T8101
>> Platform                       Unix
>> PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
>> PSRemotingProtocolVersion      2.3
>> SerializationVersion           1.1.0.1
>> WSManStackVersion              3.0


Get-Module PowerHTML

>> ModuleType Version    PreRelease Name                                ExportedCommands
>> ---------- -------    ---------- ----                                ----------------
>> Script     0.1.7                 PowerHTML                           ConvertFrom-Html

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions