-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers
Description
Hi there. Really appreciate this module using PowerShell Core. Thank you for your work!
Scraping some European websites I came across an issue in regards to special characters, like ü, ä, ö, é, ß, etc.
Somehow ConvertFrom-Html cannot handle these characters and parses them as question marks. It seems to be related to the encoding which cannot be specified by any parameter.
Any ideas how to solve this?
Example
Invoke-WebRequest content show the "ü" character correctly
$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"
$Result.Content -split "<" | Where-Object {$_ -like '*span class="box">*'}
>> span class="box">ü
ConvertFrom-Html parses that into "??"
$Html = ConvertFrom-Html -Content $Result
$Html.SelectNodes('//span[@class="box"]')
>> NodeType Name AttributeCount ChildNodeCount ContentLength InnerText
>> -------- ---- -------------- -------------- ------------- ---------
>> Element span 1 1 2 ??
Return headers show correct content-type utf-8
$Result.Headers
>> Key Value
>> --- -----
>> Server {nginx}
>> Date {Sun, 03 Oct 2021 10:22:56 GMT}
>> Connection {keep-alive}
>> X-Powered-By {Express}
>> Accept-Ranges {bytes}
>> Cache-Control {public, max-age=0}
>> ETag {W/"aabd-17a2d88a25f"}
>> X-Response-Time {0}
>> Vary {Accept-Encoding}
>> Content-Type {text/html; charset=utf-8}
>> Content-Length {43709}
>> Last-Modified {Mon, 21 Jun 2021 07:46:07 GMT}
Version info
$PSVersionTable
>> Name Value
>> ---- -----
>> PSVersion 7.1.3
>> PSEdition Core
>> GitCommitId 7.1.3
>> OS Darwin 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec 2 20:40:21 PST 2020; root:xnu-7195.60.75~1/RELEASE_ARM64_T8101
>> Platform Unix
>> PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
>> PSRemotingProtocolVersion 2.3
>> SerializationVersion 1.1.0.1
>> WSManStackVersion 3.0
Get-Module PowerHTML
>> ModuleType Version PreRelease Name ExportedCommands
>> ---------- ------- ---------- ---- ----------------
>> Script 0.1.7 PowerHTML ConvertFrom-Html
desk7
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers