[help] Straight forward standardized shorthand to decode HTML entities #3560

forthrin · 2025-09-16T06:32:58Z

forthrin
Sep 16, 2025

Please show your code!

CGI.unescapeHTML('R&oslash;de')
"R&oslash;de" # Fails

HTMLEntities.new.decode('R&oslash;de')
"Røde" # Works, but requires an extra library not used otherwise

Nokogiri.HTML5('R&oslash;de').text
"Røde" # Works, but must go via full DOM document

Nokolexbor.HTML('R&oslash;de').at('html').text
"Røde" # Works, but must go via full DOM document (and look up html/body tag, a difference from Nokogiri)

What would be an ideal official convenience shorthand in Nokogiri, save everyone making their own refinement?

Environment

ruby 2025-08-01 30a20bc16
rubocop 2025-09-06 402f0cfcf
System Version: macOS 15.6 (24G84)
Model Identifier: MacBookAir10,1

stevecheckoway · 2025-09-24T14:36:41Z

stevecheckoway
Sep 24, 2025
Maintainer

Hi @forthrin,

Nokogiri's HTML5 parser does have a function which is designed to identify HTML entities (or character references as the HTML standard calls them) and returns what they map to but on its own, that's not sufficient to get what you want. Some of the tricky bits of doing this that come to mind

Entities are handled differently depending on if they're in an attribute or not;
Entities can be erroneous and not expand into anything other than themselves (e.g., &#z) others can be erroneous and still expand to the expected text (e.g., &#x80 (missing a semicolon) expands to U+20AC (€));
Entities can be syntactically correct but still lead to errors like

Nokogiri's HTML5 parser deals with this by implementing the HTML tokenization which handles all of that. Adding what you're asking here would amount to reimplementing a portion of the tokenizer specifically to handle entities.

As an aside, the htmlentities gem doesn't handle all of the HTML5 entities (nor is it attempting to). E.g., &bne; should expand to U+0003D U+020E5 (=⃥).

0 replies

forthrin · 2026-01-05T07:11:20Z

forthrin
Jan 5, 2026
Author

This came up again, so did a test on all entities found in a cache of HTML files, and all except CGI do well on typical specimens.

Possible actions:

CGI: Support a few more of the most frequent entities
Nokos: Don't convert misnomers such as &degree; (?)
Nokolexbor: Don't require DOM for conversion

@stevecheckoway: Awaiting your comments.

Freq	Entity	Point	CGI	HTMLEntities	Nokogiri	Nokolexbor
648019	&	U+0026	&	&	&	&
417737	"	U+0022	"	"	"	"
13070	<	U+003C	<	<	<	<
13061	>	U+003E	>	>	>	>
1030	'	U+0027	'	'	'	'
176311		U+00A0		`U+00A0`	`U+00A0`	`U+00A0`
20192	©	U+00A9	©	©	©	©
143	»	U+00BB	»	»	»	»
132	«	U+00AB	«	«	«	«
54	°	U+00B0	°	°	°	°
3	®	U+00AE	®	®	®	®
4336	å	U+00E5	å	å	å	å
3659	ø	U+00F8	ø	ø	ø	ø
173	æ	U+00E6	æ	æ	æ	æ
67	é	U+00E9	é	é	é	é
5144	’	U+2019	’	’	’	’
777	–	U+2013	–	–	–	–
743	•	U+2022	•	•	•	•
222	…	U+2026	…	…	…	…
113	&rsaquo;	U+203A	&rsaquo;	›	›	›
113	&lsaquo;	U+2039	&lsaquo;	‹	‹	‹
37	—	U+2014	—	—	—	—
14	&dagger;	U+2020	&dagger;	†	†	†
1	&permil;	U+2030	&permil;	‰	‰	‰
192	€	U+20AC	€	€	€	€
3	™	U+2122	™	™	™	™

0 replies

stevecheckoway · 2026-01-05T17:38:01Z

stevecheckoway
Jan 5, 2026
Maintainer

Hi @forthrin,

I'm not sure I understand exactly what you're asking me to do. Writing a wrapper function to call Nokogiri appropriately and extract the result seems like a simple 2-line function. I'll defer to Mike on whether that's something he wants to include in Nokogiri or not but my inclination would be not to. (It could be done in another gem as well, but that seems like a left-pad situation so probably not a great idea either.)

If the problem is Nokogiri builds a full DOM, then well, yeah. That's what it does.

If you wanted a function that took a single HTML entity and decoded it, I could see exposing that functionality from Nokogiri (that would be the internal function match_named_char_ref()) but I don't think that's what you want based on your examples.

You want to take a restricted set of HTML and then decode it to text. But I don't know what that restricted set is or what behavior your intended decoding function should have in the presence of other HTML markup. E.g., what would you like <span>&</span> to return? What about broken markup? What about 0-bytes? What about other errors?

0 replies

nwellnhof · 2026-01-21T19:49:52Z

nwellnhof
Jan 21, 2026

Nokogiri.HTML5('R&oslash;de').text

The problem with this approach is that it will consume leading white-space. You have to replace < with < and prepend a <body> tag. Then it should be parsed as a single text node as you would expect.

1 reply

stevecheckoway Jan 21, 2026
Maintainer

Nokogiri.HTML5('R&oslash;de').text
The problem with this approach is that it will consume leading white-space. You have to replace < with < and prepend a <body> tag. Then it should be parsed as a single text node as you would expect.

require 'nokogiri'

def decode(input)
  return Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new,
                                               input,
                                               context:'textarea')
    .text
end

while line = gets
  puts(decode(line.chomp))
end

This essentially 1-line function seems to work for me. Caching and reusing the Document might be worthwhile, but I'd definitely test that.

forthrin · 2026-01-21T20:44:22Z

forthrin
Jan 21, 2026
Author

OK, I suddenly realize the need to go into the relevant use cases, because normally you get this for free with a regular HTML document, so when do you need to decode explicitly apart from that?

Podcast descriptions with HTML and entities
HTML so last century that all efforts of using CSS paths crash and burn, reverting to RegExp and ultimately unescaping.
Sites who think it's a good idea to pass HTML snippets in XHR JSON values rather than pure data.

And these are usually always the same few culprit entities, thus the request.

1 reply

stevecheckoway Jan 21, 2026
Maintainer

OK, I suddenly realize the need to go into the relevant use cases, because normally you get this for free with a regular HTML document, so when do you need to decode explicitly apart from that?
* Podcast descriptions with HTML and entities

* HTML so last century that all efforts of using CSS paths crash and burn, reverting to RegExp and ultimately unescaping.

* Sites who think it's a good idea to pass HTML snippets in XHR JSON values rather than pure data.
And these are usually always the same few culprit entities, thus the request.

Hi @forthrin,

You didn't respond to these questions which makes it difficult to help:

You want to take a restricted set of HTML and then decode it to text. But I don't know what that restricted set is or what behavior your intended decoding function should have in the presence of other HTML markup. E.g., what would you like & to return? What about broken markup? What about 0-bytes? What about other errors?

forthrin · 2026-01-21T21:10:57Z

forthrin
Jan 21, 2026
Author

The restricted set I've outlined in OP, ie. quarterblocks 0080, 2000 plus the couple of currencies. Latin 1 would also be a good bonus because of it's legacy prevalence.

Whenever I've encountered broken HTML, I've usually just gotten rid of all of it with a quick gsub, because it has never had any value in itself, gsub(%r{</?.*?>}, '').

A single & would remain a single &. 0-bytes I've never encountered in the wild either. I supposed they would remain as-is. Any particular reason you bring these up, except covering your bases for every possible scenartio?

1 reply

stevecheckoway Jan 21, 2026
Maintainer

So if I understand correctly, you're happy removing all other HTML markup leaving just plain text and entities.

I mentioned the broken HTML and 0 bytes because the standard gives meaning to nearly every sequence of bytes, it's just that many of these result in errors.

Given that, I'd recommend the decode function I gave above, potentially with your gsub call to remove all start and end tags.

If that's too slow for your use case, then I think the options are

Improve the parsing performance of the gumbo library; or
Implement a stand-alone portion of the tokenizer.

For approach 1, I'd suggest looking at all of the places where adding character just appends it to a string buffer and doesn't change state and handle that in a loop rather than going through the state dispatch mechanism (which is surprisingly complicated).

For approach 2, I'd recommend implementing the RCData state other than the handling for <.

Note that this approach is what my decode function is approximating. Specifically, it's parsing a fragment in the context of a textarea HTML element. The rules for that say to start the tokenizer in the RCData state and it'll stay in that state or the character reference states until the end of the input. The title element behaves similarly and could be used instead of textarea.

forthrin · 2026-01-22T06:41:54Z

forthrin
Jan 22, 2026
Author

Both Nokogiri and Nokolexbor already support all entities known to man:

curl -s https://html.spec.whatwg.org/entities.json | jq -r '. | keys[]' | tr -d '\n' | ruby -r nokogiri -e 'puts Nokogiri.HTML5(STDIN).text'

So this seems to come down to two things.

Performance: How much overhead adds Nokogiri.HTML5().text and Nokolexbor.HTML(').at('html').text?
Terseness: What options are there for convenience functions? Are refinements OoS for libraries and left to each dev?

module Nice
  refine String do
    def dehtml
      Nokogiri.HTML5(self).text # gsub(%r{</?.*?>}, '')
    end
  end
end

puts '<blink>R&oslash;de</blink>'.dehtml # Røde

0 replies

Uh oh!

[help] Straight forward standardized shorthand to decode HTML entities #3560

Uh oh!

Uh oh!

forthrin Sep 16, 2025

Replies: 7 comments · 3 replies

Uh oh!

stevecheckoway Sep 24, 2025 Maintainer

Uh oh!

Uh oh!

forthrin Jan 5, 2026 Author

Uh oh!

stevecheckoway Jan 5, 2026 Maintainer

Uh oh!

nwellnhof Jan 21, 2026

Uh oh!

stevecheckoway Jan 21, 2026 Maintainer

Uh oh!

forthrin Jan 21, 2026 Author

Uh oh!

stevecheckoway Jan 21, 2026 Maintainer

Uh oh!

forthrin Jan 21, 2026 Author

Uh oh!

stevecheckoway Jan 21, 2026 Maintainer

Uh oh!

Uh oh!

forthrin Jan 22, 2026 Author

forthrin
Sep 16, 2025

Replies: 7 comments 3 replies

stevecheckoway
Sep 24, 2025
Maintainer

forthrin
Jan 5, 2026
Author

stevecheckoway
Jan 5, 2026
Maintainer

nwellnhof
Jan 21, 2026

stevecheckoway Jan 21, 2026
Maintainer

forthrin
Jan 21, 2026
Author

stevecheckoway Jan 21, 2026
Maintainer

forthrin
Jan 21, 2026
Author

stevecheckoway Jan 21, 2026
Maintainer

forthrin
Jan 22, 2026
Author