[help] Straight forward standardized shorthand to decode HTML entities #3560
Replies: 7 comments 3 replies
-
|
Hi @forthrin, Nokogiri's HTML5 parser does have a function which is designed to identify HTML entities (or character references as the HTML standard calls them) and returns what they map to but on its own, that's not sufficient to get what you want. Some of the tricky bits of doing this that come to mind
Nokogiri's HTML5 parser deals with this by implementing the HTML tokenization which handles all of that. Adding what you're asking here would amount to reimplementing a portion of the tokenizer specifically to handle entities. As an aside, the htmlentities gem doesn't handle all of the HTML5 entities (nor is it attempting to). E.g., |
Beta Was this translation helpful? Give feedback.
-
|
This came up again, so did a test on all entities found in a cache of HTML files, and all except Possible actions:
@stevecheckoway: Awaiting your comments.
|
Beta Was this translation helpful? Give feedback.
-
|
Hi @forthrin, I'm not sure I understand exactly what you're asking me to do. Writing a wrapper function to call Nokogiri appropriately and extract the result seems like a simple 2-line function. I'll defer to Mike on whether that's something he wants to include in Nokogiri or not but my inclination would be not to. (It could be done in another gem as well, but that seems like a left-pad situation so probably not a great idea either.) If the problem is Nokogiri builds a full DOM, then well, yeah. That's what it does. If you wanted a function that took a single HTML entity and decoded it, I could see exposing that functionality from Nokogiri (that would be the internal function You want to take a restricted set of HTML and then decode it to text. But I don't know what that restricted set is or what behavior your intended decoding function should have in the presence of other HTML markup. E.g., what would you like |
Beta Was this translation helpful? Give feedback.
-
The problem with this approach is that it will consume leading white-space. You have to replace |
Beta Was this translation helpful? Give feedback.
-
|
OK, I suddenly realize the need to go into the relevant use cases, because normally you get this for free with a regular HTML document, so when do you need to decode explicitly apart from that?
And these are usually always the same few culprit entities, thus the request. |
Beta Was this translation helpful? Give feedback.
-
|
The restricted set I've outlined in OP, ie. quarterblocks 0080, 2000 plus the couple of currencies. Latin 1 would also be a good bonus because of it's legacy prevalence. Whenever I've encountered broken HTML, I've usually just gotten rid of all of it with a quick gsub, because it has never had any value in itself, A single |
Beta Was this translation helpful? Give feedback.
-
|
Both Nokogiri and Nokolexbor already support all entities known to man:
So this seems to come down to two things.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Please show your code!
What would be an ideal official convenience shorthand in Nokogiri, save everyone making their own refinement?
Environment
ruby 2025-08-01 30a20bc16
rubocop 2025-09-06 402f0cfcf
System Version: macOS 15.6 (24G84)
Model Identifier: MacBookAir10,1
Beta Was this translation helpful? Give feedback.
All reactions